Method and device for optimizing a nucelotide sequence for the purpose of expression in a protein

ABSTRACT

The invention relates to a method for optimizing a nucleotide sequence for expression of a protein on the basis of the amino acid sequence of the protein, in which for a particular region there is specification of a test sequence with m optimization positions on which the codon occupation is varied, a quality function being used to ascertain the optimal codon occupation on these optimization positions, and one or more codons of this optimal occupation being specified as codons of the optimized nucleotide sequence. These steps are iterated, with the codons of the optimized nucleotide sequence which are specified in the preceding steps remaining unchanged in subsequent iteration steps. The invention additionally relates to a device for carrying out this method.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a CON of U.S. application Ser. No. 14/034,449 filedSep. 23, 2013, which is a DIV of U.S. application Ser. No. 13/524,943filed Jun. 15, 2012 (ABN), which is CON of U.S. application Ser. No.10/539,208 filed May 24, 2006 (U.S. Pat. No. 8,224,578), which is a U.S.National Phase application of INTL. Application No. PCT/EP2003/014850filed Dec. 23, 2003 and claims priority to DE Application No. 10260805.9filed Dec. 23, 2002, which disclosures are herein incorporated byreference in their entirety.

BACKGROUND

The invention relates generally to the production of synthetic DNAsequences and to the use thereof for producing proteins by introducingthese DNA sequences into an expression system, for example into a hostorganism/a host cell or a system for in vitro expression, any of whichexpresses the appropriate protein. It relates in particular to methodsin which a synthetic nucleotide sequence is optimized for the particularexpression system, that is to say for example for an organism/for a hostcell, with the aid of a computer.

One technique for the preparation and synthesis of proteins is thecloning and expression of the gene sequence corresponding to the proteinin heterologous systems, e.g. Escherichia coli or yeast. Naturallyoccurring genes are, however, frequently suboptimal for this purpose.Since in a DNA sequence expressing a protein in each case one triplet ofbases (codon) expresses one amino acid, it is possible for an artificialDNA sequence for expression of the desired protein to be synthesized andto be used for cloning and expression of the protein. One problem withthis procedure is that a predefined amino acid sequence does notcorrespond to a unique nucleotide sequence. This is referred to as thedegeneracy of the genetic code. The frequency with which differentorganisms use codons for expressing an amino acid differs (called thecodon usage). There is ordinarily in a given organism one codon which ispredominantly used and one or more codons which are used withcomparatively low frequency by the organism for expressing thecorresponding amino acid. Since the synthesized nucleotide sequence isto be used in a particular organism, the choice of the codons ought tobe adapted to the codon usage of the appropriate organism. A furtherimportant variable is the GC content (content of the bases guanine andcytosine in a sequence). Further factors which may influence the resultof expression are DNA motifs and repeats or inverse complementaryrepeats in the base sequence. Certain base sequences produce in a givenorganism certain functions which may not be desired within a codingsequence. Examples are cis-active sequence motifs such as splice sitesor transcription terminators. The unintentional presence of a particularmotif may reduce or entirely suppress expression or even have a toxiceffect on the host organism. Sequence repeats may lead to lower geneticstability and impede the synthesis of repetitive segments owing to therisk of incorrect hybridizations. Inverse complementary repeats may leadto the formation of unwanted secondary structures at the RNA level orcruciform structures at the DNA level, which impede transcription andlead to genetic instability, or may have an adverse effect ontranslation efficiency.

A synthetic gene ought therefore to be optimized in relation to thecodon usage and the GC content and, on the other hand, substantiallyavoid the problems associated with DNA motifs and sequence repeats andinverse complementary sequence repeats. These requirements cannot,however, ordinarily be satisfied simultaneously and in an optimalmanner. For example, optimization to optimal codon usage may lead to ahighly repetitive sequence and a considerable difference from thedesired GC content. The aim therefore is to reach a compromise which isas optimal as possible between satisfying the various requirements.However, the large number of amino acids in a protein leads to acombinatorial explosion of the number of possible DNA sequences which—inprinciple—are able to express the desired protein. For this reason,various computer-assisted methods have been proposed for ascertaining anoptimal codon sequence.

P. S. Sarkar and Samir K. Brahmachari, Nucleic Acids Research 20 (1992)5713 describe investigations into the role of the choice of codons inthe formation of certain spatial structures of a DNA sequence. Thisinvolved generation of all the possible degenerate nucleotide sequences.Assessment of the sequences in relation to the presence of structuralmotifs and to structure-forming segments was performed by a computerusing a knowledge base. The use of a quality function is not disclosed.

D. M. Hoover and J. Lubkowski, Nucleic Acid Research 30 (2002), No. 10e43 proposes a computer-assisted method in which the nucleotide sequenceis divided into an odd number of segments for each of which a qualityfunction (score) is calculated. The quality function includes inter aliathe codon usage, the possibility of forming hairpin structures and thedifferences from the desired melting temperature. The value of thequality function for the complete sequence is determined from the totalof the values of the quality function for the individual segments. Thecodon occupation within a segment is optimized by a so-calledMonte-Carlo method. This entails random selection of codon positions inwhich the codon of an initial sequence is replaced by a randomlyselected equivalent codon. At the same time, the limits of the segmentsare redefined in an iteration. In this way there is random generation ofa complete gene sequence. If the value of the quality function for thecomplete sequence is less than the previous sequence, the new sequenceis retained. If it is larger, the new sequence is retained with acertain probability, this probability being controlled by a Boltzmannstatistic. If the sequence does not change during a predetermined numberof iterations, this sequence is regarded as optimal sequence.

Random methods of this type have the disadvantage that they dependgreatly on the choice of the convergence criteria.

SUMMARY

It is the object of the invention to provide an alternative method foroptimizing a nucleotide sequence for the expression of a protein on thebasis of the amino acid sequence of the protein, which can beimplemented with relatively little storage space and relatively littlecomputing time on a computer, and which avoids in particular thedisadvantages of the random methods.

This object is achieved according to the invention by a method foroptimizing a nucleotide sequence for the expression of a protein on thebasis of the amino acid sequence of the protein, which comprises thefollowing steps carried out on a computer:

-   -   generation of a first test sequence of n codons which correspond        to n consecutive amino acids in the protein sequence, where n is        a natural number and is less than or equal to N, the number of        amino acids in the protein sequence,    -   specification of m optimization positions in the test sequence        which correspond to the position of m codons, in particular of m        consecutive codons, at which the occupation by a codon, relative        to the test sequence, is to be optimized, where m n and m<N,    -   generation of one or more further test sequences from the first        test sequence by replacing at one or more of the m optimization        positions a codon of the first test sequence by another codon        which expresses the same amino acid,    -   assessment of each of the test sequences with a quality function        and ascertaining the test sequence which is optimal in relation        to the quality function,    -   specification of p codons of the optimal test sequence which are        located at one of the m optimization positions, as result codons        which form the codons of the optimized nucleotide sequence at        the positions which corresponds to the position of said p codons        in the test sequence, where p is a natural number and p≦m,    -   iteration of the preceding steps, where in each iteration step        the test sequence comprises the appropriate result codon at the        positions which correspond to positions of specified result        codons in the optimized nucleotide sequence, and the        optimization positions are different from positions of result        codons.

According to the preferred embodiment of the invention, theaforementioned steps are iterated until all the codons of the optimizednucleotide sequence have been specified, i.e. occupied by result codons.

Thus, the optimization according to the invention is not of the sequenceas a whole but successively on part regions. The p result codonsspecified as optimal in one iteration step are not changed again in thesubsequent iteration steps and, on the contrary, are assumed to be givenin the respective optimization steps. It is preferred for the number ofresult codons which are specified in this way for further iterations andare treated as predefined to be smaller than the number m ofoptimization positions at which the codons are varied in an iterationstep. In at least the majority of iteration steps and, in a particularembodiment, in all iteration steps apart from the first, in turn m issmaller than the number of codons of the test sequence (n). This makesit possible to take account not only of local effects on the m variedpositions, but also of wider-ranging correlations, e.g. in connectionwith the development of RNA secondary structures.

According to the embodiments preferred at present, m is in the rangefrom 3 to 20, preferably in the range from 5 to 10. With this choice ofthis parameter it is possible to vary the codons with an acceptableusage of storage and computing time and, at the same time, achieve goodoptimization of the sequence.

According to one embodiment, m need not be the same in the variousiteration steps but, on the contrary, may also be different in differentiteration steps. It is also possible to provide for variation of thetest sequence for different values of m to be carried out in oneiteration step and, where appropriate, for taking account only of theoptimization result for one value of m, in order to reduce influences ofthe quantity m on the optimization result, and in order to check whetheran increase in the number m leads to a change in the result.

According to the preferred embodiment, the m optimization positions orat least some of them are connected and thus form a variation window, onwhich the codon occupation is varied, in the test sequence.

The invention can in particular provide for some of the m optimizationpositions on which the codons are varied to be identical in two or moreconsecutive iteration steps. If the m positions are connected, thismeans that the variation window in one iteration step overlaps with thevariation window of a preceding iteration step.

The invention can provide for the m optimization positions of the testsequences in one or more iteration steps to follow directly one or moreresult codons which have been specified as part of the optimizednucleotide sequence.

The invention can likewise provide for the p codons which are specifiedas result codons of the optimized nucleotide sequence in one or moreiteration steps to be p consecutive codons which preferably directlyfollow one or more result codons which have been specified as part ofthe optimized nucleotide sequence in an earlier step.

The invention can provide for the nucleotide sequence to be optimizedfrom one of its ends. In particular, the invention can provide for anincrease in each iteration step of the length of the test sequence ofthe previous iteration step by a particular number of codons, which maybe different in different iterations, until n=N. If n=N and the numberof positions in the test sequence not occupied by result codons issmaller than or equal to the value of m used in the precedingiterations, or if this number on use of different values of m indifferent iterations is in the region of the values of m in question, itis possible to set p=m in the corresponding iteration step, where m isat the same time the number of codons not yet specified. The occupationwhich is found to be optimal for the optimization positions is thenaccepted for the result codons at these optimization positions. Thisapplies in particular when a test sequence is generated for everypossible combination of occupations of the optimization positions.

However, it is also possible to provide for the region of the testsequence within the complete sequence in one iteration step not, or notcompletely, to include the region of a test sequence in a previousiteration step. For example, the test sequence itself may form a windowon the complete sequence, e.g. a window of fixed length, which window isshifted on the complete sequence during the various iterations.

According to a preferred embodiment, the test sequence is extended aftereach step by p codons, it being possible in particular for m to beconstant for all iteration steps.

In analogy to the embodiment of the invention described above, it isalso possible to provide for the nucleotide sequence to be optimizedfrom a site in its interior. This can take place for example in such away that an initial test sequence corresponding to a region in theinterior of the nucleotide sequence to be optimized is initiallyenlarged successively on one side until the end of the nucleotidesequence to be optimized or another predefined point is reached on thenucleotide sequence to be optimized, and then the test sequence isenlarged towards the other side until the other end of the nucleotidesequence to be optimized or another predetermined point is reached thereon the nucleotide sequence to be optimized.

The invention can also provide for the test sequences in one iterationstep to consist of an optimized or otherwise specified partial sequenceof length q and two variation regions which are connected on both sidesthereof and have a length of respectively m₁ and m₂ codons, whereq+m₁+m₂=n. The occupation of the variation regions can be optimized forboth variation regions together by simultaneously varying and optimizingthe codons on the m₁ and m₂ locations. It is preferred in such a casefor p₁ and p₂ codons in the first and second variation region, which areused as given basis for the further iteration, to be specified in eachiteration step. However, it is also possible to provide for the twovariation regions to be varied and optimized independently of oneanother. For example, it is possible to provide for the occupation to bevaried in only one of the two variation regions, and for codons to bespecified only in the one region, before the variation and optimizationin the second region takes place. In this case, the p₁ specified codonsin the first region are assumed as given in the optimization of thesecond region. This procedure is worthwhile when small correlations atthe most are to be expected between the two regions.

According to this embodiment, it is possible to provide for thenucleotide sequence to be optimized starting from a point or a region inthe interior of the sequence.

The invention can provide in particular for the region of the testsequence on the complete sequence in each iteration step to include theregion of the test sequences in all the preceding iteration steps, andfor the region of a test sequence in at least some of the precedingiteration steps to be located in each case in the interior or in eachcase at the border of the region of the test sequence in the currentiteration step.

The invention can provide for the nucleotide sequence to be optimizedindependently on different part regions. The optimized nucleotidesequence can then be the combination of the different optimized partialsequences. It is also possible to provide for at least some of therespective result codons from two or more optimized part regions to beused as constituent of a test sequence in one or more iterations.

A preferred embodiment of the invention provides for test sequences withall possible codon occupations for the m optimization positions to begenerated in one iteration step from the first test sequence, and theoptimal test sequence to be ascertained from all possible test sequencesin which a codon at one or more of the m optimization positions has beenreplaced by another codon which expresses the same amino acid.

According to one embodiment of the invention, the quality function usedto assess the test sequences is the same in all or at least the majorityof the iterations. The invention may, however, also provide fordifferent quality functions to be used in different iterations, forexample depending on the length of the test sequences.

The method of the invention may comprise in particular the followingsteps:

-   -   assessment of each test sequence with a quality function,    -   ascertaining of an extreme value within the values of the        quality function for all partial sequences generated in an        iteration step,    -   specification of p codons of the test sequence which corresponds        to the extremal value of the weight function as result codons at        the appropriate positions, where p is a natural number and p≦m.

The quality function can be defined in such a way that either a largervalue of the quality function means that the sequence is nearer theoptimum, or a smaller value means that it is nearer the optimum.Correspondingly, the maximum or the minimum of the quality functionamong the generated codon sequences will be ascertained in the step ofascertaining the extreme value.

The invention can provide for the quality function to take account ofone or more of the following criteria: codon usage for a predefinedorganism, GC content, sequence motifs, repetitive sequences, secondarystructures, inverse repeats.

The invention can provide in particular for the quality function to takeaccount of one or more of the following criteria:

cis-active sequence motifs, especially DNA/protein interaction bindingsites and RNA/protein interaction binding sites, preferably splicemotifs, transcription factor binding sites, transcription terminatorbinding sites, polyadenylation signals, endonuclease recognitionsequences, immunomodulatory DNA motifs, ribosome binding sites,recognition sequences for recombination enzymes, recognition signals forDNA-modifying enzymes, recognition sequences for RNA-modifying enzymes,sequence motifs which are underrepresented in a predefined organism.

The invention can also provide for the quality function to take accountof one or more of the following criteria:

-   -   exclusion or substantial exclusion of inverse complementary        sequence identities of more than 20 nucleotides to the        transcriptome of a predefined organism,    -   exclusion or substantial exclusion of homology regions of more        than 1000 base pairs, preferably 500 base pairs, more preferably        100 base pairs, to a predefined DNA sequence, for example to the        genome of predefined organism or to the DNA sequence of a        predefined vector construct.

The first of the two criteria relates to the exclusion of the mechanismknown as RNA indifference, with which an organism eliminates ordeactivates RNA sequences with more than 20 nucleotides exactlyidentical to another RNA sequence. The intention of the second criterionis to prevent the occurrence of recombination, that is to sayincorporation of the sequence into the genetic material of the organism,or mobilization of DNA sequences through recombination with othervectors. Both criteria can be used as absolute exclusion criteria, i.e.sequences for which one or both of these criteria are satisfied are nottaken into account. The invention can also provide, as explained in moredetail below in connection with sequence motifs, for these criteria tobe assigned a weight which in terms of contribution is larger than thelargest contribution of criteria which are not exclusion criteria to thequality function.

The invention can also, where appropriate together with other criteria,provide the criterion that no homology regions showing more than 90%similarity and/or 99% identity to a predefined DNA sequence, for exampleto the appropriate genome sequence of the predefined organism or to theDNA sequence of a predefined vector construct, are generated. Thiscriterion can also be implemented either as absolute exclusion criterionor in such a way that it makes a very large contribution, outweighingthe contribution of other criteria which are not exclusion criteria, tothe quality function.

It is possible to provide in particular for the quality function to be afunction of various single terms, in particular a total of single terms,which in each case assess one criterion from the following list ofcriteria: codon usage for a predefined organism, GC content, DNA motifs,repetitive sequences, secondary structures, inverse repeats.

Said function of single terms may be in particular a linear combinationof single terms or a rational function of single terms. The criteriamentioned need not necessarily be taken completely into account in theweight function. It is also possible to use only some of the criteria inthe weight function.

The various single terms in said function are called criterion weightshereinafter.

The invention can provide for the criterion weight relating to the codonusage (CU score) to be proportional to Σ_(i) f_(ci)/f_(cmaxi), where

-   -   f_(ci) is the frequency of the codon placed at site i of the        test sequence for the relevant organism to express the amino        acid at site i in the amino acid sequence of the protein to be        expressed, and    -   f_(cmaxi) is the frequency of the codon which expresses most        frequently the amino acid at site i in the corresponding        organism.

The measure f_(ci)/f_(cmaxi) is known as the relative adaptiveness (cf.P. M. Sharp, W. H. Li, Nucleic Acids Research 15 (3) (1987), 1281 to1295).

The local weight of the most frequently occurring codon is in this case,irrespective of the absolute frequency with which this codon occurs, setat a particular value, for example 1. This avoids the positions at whichonly a few codons are available for selection making a greatercontribution to the total weight than those at which a larger number ofcodons are available for selection for expression of the amino acid. Theindex i may run over the entire n codons of the test sequence or a partthereof. In particular, it is possible to provide in one embodiment fori to run only over the m codons of the optimization positions.

The invention can provide for the criterion weight relating to the codonusage to be used only for the m ordering positions.

It is possible to use instead of the relative adaptiveness also theso-called RSCU (relative synonymous codon usage; cf. P. M. Sharp, W. H.Li, loc. cit.). The RSCU for a codon position is defined by

RSCU_(ci) =f _(ci) d _(i)/(Σ_(c) f _(ci))

where the sum in the denominator runs over all the codons which expressthe amino acid at site i, and where d_(i) indicates the number of codonswhich express said amino acid. In order to define a criterion weight onthe basis of the RSCU it is possible to provide for the RSCU to besummed for the respective test sequence over all the codons of the testsequence or a part thereof, in particular over the m codons of theoptimization positions. The difference from the criterion weight derivedfrom the relative adaptiveness is that with this weighting each codonposition is weighted with the degree of degeneracy, d_(i), so thatpositions at which more codons are available for selection participatemore in the criterion weight than positions at which only a few codonsor even only a single codon are available for selection.

With the criterion weights described above for the codon usage, thearithmetic mean was formed over the local weights (relativeadaptiveness, RSCU).

It can also be provided for the criterion weight relating to the codonusage to be proportional to the geometric mean of the local relativeadaptiveness or the local RSCU, so that the following therefore applies

CUScore=K(Π_(i)RSCU_(i))^(1/L)

or

CUScore=K(Π_(i) f _(ci) /f _(cmaxi))_(1/L)

where K is a scaling factor, and L is the number of positions over whichthe product is formed. Once again, it is possible in this case to formthe product over the complete test sequence or a part, in particularover the m optimization positions.

In this connection, the invention also provides a method for optimizinga nucleotide sequence for expression of a protein on the basis of theamino acid sequence of the protein, which comprises the following stepscarried out on a computer:

-   -   generation of one or more test sequences of n codons which        correspond to n consecutive amino acids in the protein sequence,        where n is a natural number less than or equal to N, the number        of amino acids in the protein sequence,    -   assessment of the one or more test sequences on the basis of a        quality function which comprises a geometric or arithmetic mean        of the relative adaptiveness or of the RSCU over a number of L        codon positions, where L is less than or equal to N,    -   generation of one or more new test sequences depending on the        result of said assessment.

It is moreover possible for the generation of one or more new testfunctions in the manner described above to take place in such a way thatthe new test sequences comprise a particular number of result codonsspecified on the basis of the preceding iterations but, for example,also in such a way that a particular test sequence is used with aparticular probability, which depends on the value of the qualityfunction, as basis for further iterations, in particular the furthergeneration of test sequences, as is the case with Monte-Carlo methods.

Whereas the quality of a codon in the abovementioned methods is definedthrough the frequency of use in the transcriptome or a gene referenceset of the expression organism, the quality of a particular codon canalso alternatively be described by the biophysical properties of thecodon itself. Thus, for example, it is known that codons with an averagecodon-anticodon binding energy are translated particularly efficiently.

It is therefore possible to use as measure of the translationalefficiency of a test sequence for example the P2 index which indicatesthe ratio of the frequency of codons with average binding energy andcodons with extremely strong or weak binding energy. It is also possiblealternatively to utilize data obtained experimentally or by theoreticalcalculations for the translational efficiency or translation accuracy ofa codon for the quality assessment. The abovementioned assessmentcriteria may be advantageous especially when the tRNA frequencies of theexpression system need not be taken into account, because they can bespecified by the experimenter as, for example, in in vitro translationsystems.

The invention can provide for the criterion weight relating to the GCcontent (GCScore) to be a function of the contribution of the differenceof the ascertained GC content of the partial sequence, GCC, to theoptimal GC content, GCC_(pt), where the GG content means the relativeproportion of guanine and cytosine, for example in the form of aparticular percentage proportion.

The criterion weight GCScore can have the following form, in particular:

GCScore=|GCC−GCC_(opt)|^(g) ·h

where

-   -   GCC is the actual GC content of the test sequence or of a        predetermined part of the test sequence, GCC, or the average GC        content of the test sequence or of a predetermined part of the        test sequence, <GCC>,    -   GCC_(opt) is the desired (optimal) GC content,    -   g is a positive real number, preferably in the range from 1 to        3, in particular 1.3,    -   h is a positive real number.

The factor h is essentially a weighting factor which defines therelative weight of the criterion weight GCScore vis-à-vis the othercriterion weights. Preferably, h is chosen so that the amount of themaximally achievable value of GCScore is in a range from one hundredthof up to one hundred times another criterion weight, in particular allcriterion weights which represent no exclusion condition, such as, forexample, the weights for a wanted or unwanted sequence motif.

To determine the average GC content it is possible to provide for alocal GC content relating to a particular base position to be defined bythe GC content on a window which was a particular size and whichcomprises this base and which, in particular, can be centered on thisbase. This local GC content is then averaged over the test sequence or apart region of the test sequence, in particular over the m optimizationpositions, it being possible to use both an arithmetic mean and ageometric mean here too. On use of an average GC content defined in thisway there are fewer variations between test sequences differing inlength n.

The invention can provide for the GC content to be ascertained over awindow which is larger than the region of the m optimization positionsand includes this. If the optimization positions form a coherentvariation window it is possible to provide for b bases before and/orafter the variation window to be included in the determination of thecriterion weight for the GC content (GCScore), where b can be in a rangefrom 15 to 45 bases (corresponding to 5 to 15 codons), preferably in arange from 20 to 30 bases.

The invention can further provide, inasmuch as the quality function ismaximized, for a fixed amount to be subtracted for each occurrence of asequence motif which is not permitted or is unwanted, and for a fixedamount to be added for each wanted or required motif, when ascertainingthe value of the quality function (and vice versa for minimization ofthe quality function). This amount for unwanted or required motifs canbe distinctly larger than all other criterion weights, so that the othercriteria are unimportant compared therewith. An exclusion criterion isachieved thereby, while at the same time there is differentiationaccording to whether a motif has occurred once or more than once.However, it is likewise possible to define a worthwhile quality functionand carry out an assessment of the test sequences with the qualityfunction even if the condition relating to the sequence motif(non-presence of a particular motif/presence of a particular motif)cannot be satisfied for all test sequences produced in an iterationstep. This will be the case in particular when the length n of the testsequences is relatively small compared with N, because a particularmotif can often occur only when n is relatively large, because of thepredefined amino acids of the protein sequence.

The invention can further provide for the complete test sequence or partthereof to be checked for whether particular partial sequence segmentsor sequence segments similar to particular partial sequence segmentsoccur in another region of the test sequence or of a given region of thetest sequence or whether particular partial sequence segments orsequence segments similar to particular partial sequence segments occurin the inverse complementary test sequence or part of the inversecomplementary test sequence, and for a criterion weight for sequencerepeats (repeats) and/or inverse sequence repeats (inverse repeats) tobe calculated dependent thereon.

Ordinarily, the sequence will be checked not only for whether aparticular sequence segment is present identically in the test sequenceor the inverse complementary test sequence or of a part region thereof,but also for whether a similar, i.e. only partially matching, sequenceis present in the test sequence or the inverse complementary testsequence or of a part thereof. Algorithms for finding global matches(global alignment algorithms) or local matches (local alignmentalgorithms) of two sequences are generally known in bioinformatics.Suitable methods include, for example, the dynamic programmingalgorithms generally known in bioinformatics, e.g. the so-calledNeedleman-Wunsch algorithm for global alignment and the Smith-Watermanalgorithm for local alignment. In this regard, reference is made forexample to Michael S. Waterman, Introduction to Computational Biology,London, New York 2000, especially pages 207 to 209 or Dan Gusfield,Algorithms on Strings, Trees and Sequences, Cambridge, 1999, especiallypages 215 to 235.

The invention can in particular provide for every repeat of a partialsequence segment in another part of the test sequence or of a predefinedregion of the test sequence to be weighted with a particular weightwhich represents a measure of the degree of match and/or the size of themutually similar segments, and for the weights of the individual repeatsto be added to ascertain the criterion weight relating to the repeats orinverse complementary repeats. It is likewise possible to provide forthe weights of the individual repeats to be exponentiated with apredefined exponent whose value is preferably between 1 and 2, and thenfor the summation to ascertain the criterion weight relating to therepeats or inverse complementary repeats to be carried out. It ismoreover possible to provide for repeats below a certain length and/orrepeats whose weight fraction is below a certain threshold not to betaken into account. The invention can provide, for the calculation ofthe appropriate criterion weight, for account to be taken only of therepeats or inverse complementary repeats of a partial sequence segmentwhich is located in a predefined part region of the test sequence (testregion), e.g. at its end and/or in a variation window. It is possible toprovide for example for only the last 36 bases of the test sequence tobe checked for whether a particular sequence segment within these 36bases matches with another sequence segment of the complete testsequence or of the complete inverse complementary test sequence.

The invention can provide for only the segment or the M segments of thetest sequence which provide the largest, or largest in terms of amount,contribution to the criterion weight, where M is a natural number,preferably between 1 and 10, to be taken into account in the criterionweights relating to repeats, inverse complementary repeats and/or DNAmotifs.

According to one embodiment of the invention, it is possible to providefor generation of a matrix whose number of columns corresponds to thenumber of positions of the region of the test sequence (test region)which is to be checked for repeats in other regions, and whose number ofrows corresponds to the number of positions of the region of the testsequence with which comparison is intended (comparison region). Both thetest region and the comparison region may include the complete testsequence.

The invention can further provide for the total weight function TotScoreto be determined as follows:

TotScore=CUScore−GCScore−REPScore−SiteScore

where CUScore is the criterion weight for the codon usage, GCScore isthe criterion weight for the GC content, REPScore is the criterionweight for repeats and inverse complementary repeats of identical orsimilar sequence segments, and SiteScore is the criterion weight for theoccurrence of unwanted or required motifs.

The weight REPScore can, according to one embodiment of the invention,consist of a sum of two components, of which the first indicates thecriterion weight for the repeat of identical or similar sequencesegments in the test sequence itself or of a part region thereof, andthe second component indicates the criterion weight for inversecomplementary repeats of identical or similar sequence segments in thetest sequence or of a part region thereof.

If the quality function is composed of portions of a plurality of testcriteria, especially when the quality function consists of a linearcombination of criterion weights, a test sequence need not necessarilybe assessed according to all criteria in an iteration step. On thecontrary, the assessment can be stopped as soon as it is evident thatthe value of the quality function is less or, speaking more generally,less optimal than the value of the quality function of a test sequencewhich has already been assessed. In the embodiments describedpreviously, most of the criteria, such as the criterion weights forrepetitive elements, motifs to be excluded etc., are included negativelyin the quality function. If, after calculating the criterion weightswhich are included positively in the quality function and, whereappropriate, some of the criterion weights which are included negativelyin the quality function, the summation corresponding to the linearcombination, defined by the quality function, of the appropriatepreviously calculated criterion weights gives a value which is smallerthan a previously calculated value of the complete quality function foranother test sequence, the currently assessed test sequence can beeliminated at once. It is likewise frequently possible, for example whena criterion weight is considerably larger in terms of amount than allthe other weights, for the assessment to be stopped at once afterascertaining the corresponding criterion weight. If, for example, anunwanted motif has not appeared in a first test sequence, and theunwanted motif appears in a second test sequence, the second testsequence can be immediately excluded, because the criterion weight forthe motif search is so large that it cannot be compensated by othercriterion weights.

The invention can provide in particular in embodiments in which thequality function can be calculated iteratively for there to be, in atleast one iteration, determination of an upper (or in the case ofoptimization to the minimum of the quality function lower) limit below(or above) which the value of the complete quality function lies, andfor the iteration of the quality function to be stopped when this valueis below (or above) the value which has previously been ascertained forthe complete quality function for a test sequence.

The invention can provide in these cases for said upper or lower limitto be used if necessary as value of the quality function in the furthermethod for this test sequence, and/or for the corresponding testsequence to be eliminated in the algorithm, for example through thevariable for the optimized test sequence remaining occupied by apreviously found test sequence for which the quality function a highervalue than the abovementioned limit, and the algorithm to go on to theassessment of the next test sequence. The invention can moreover,especially when the quality function is a linear combination ofcriterion weights, provide for calculation in the first iterations ofthat contribution or those contributions whose highest value or whoseminimal value has the highest absolute value.

The invention can provide in the case of a quality function which isoptimized to its maximum and which is formed by a linear combination ofcriterion weights for firstly the positive portions of the linearcombination to be calculated and the iteration to be stopped when, inone iteration after the calculation of all positive criterion weights,the value of the quality function in this iteration is smaller than thevalue of the complete quality function for another test sequence.

The invention can also provide for an iteration of the quality functionto be stopped when it is found in an iteration that the sum of the valueof the quality function calculated in this iteration and the maximumvalue of the contribution of the as yet uncalculated criterion weightsis below the value of the complete quality function of another testsequence.

The method of the invention may include the step of synthesizing theoptimized nucleotide sequence.

It is possible to provide in this connection for the step ofsynthesizing the optimized nucleotide sequence to take place in a devicefor automatic synthesis of nucleotide sequences, for example in anoligonucleotide synthesizer, which is controlled by the computer whichoptimizes the nucleotide sequence.

The invention can provide in particular for the computer, as soon as theoptimization process is complete, to transfer the ascertained dataconcerning the optimal nucleotide sequence to an oligonucleotidesynthesizer and cause the latter to carry out the synthesis of theoptimized nucleotide sequence.

This nucleotide sequence can then be prepared as desired. The protein isexpressed by introducing the appropriate nucleotide sequence into hostcells of a host organism for which it is optimized and which theneventually produces the protein.

The invention also provides a device for optimizing a nucleotidesequence for the expression of a protein on the basis of the amino acidsequence of the protein, which has a computer unit which comprises:

-   -   a unit for generation of a first test sequence of n codons which        correspond to n consecutive amino acids in the protein sequence,        where n is a natural number less than or equal to N, the number        of amino acids in the protein sequence,    -   a unit for specification of m optimization positions in the test        sequence which correspond to the position of m codons at which        the occupation by a codon, relative to the test sequence, is to        be optimized, where m≦n and m<M,    -   a unit for generation of one or more further test sequences from        the first test sequence by replacing at one or more of the m        optimization positions a codon of the first test sequence by        another codon which expresses the same amino acid,    -   a unit for assessment of each of the test sequences with a        quality function and for ascertaining the test sequence which is        optimal in relation to the quality function,    -   a unit for specification of p codons of the optimal test        sequence which are located at one of the m optimization        positions, as result codons which form the codons of the        optimized nucleotide sequence at the positions which correspond        to the positions of said p codons in the test sequence, where p        is a natural number and p≦m,    -   a unit for iteration of the steps of generation of a plurality        of test functions, of assessment of the test sequences and of        specification of result codons, preferably until all the codons        of the optimized nucleotide sequence have been specified, where        in each iteration step the test sequence comprises the        appropriate result codon at the positions which correspond to        positions of specified result codons in the optimized nucleotide        sequence, and the optimization positions are different from        positions of result codons.

The aforementioned units need not be different but may, in particular,be implemented by a single device which implements the functions of theaforementioned units.

The device of the invention may generally have a unit for carrying outthe steps of the methods described above. The device of the inventionmay have an oligonucleotide synthesizer which is controlled by thecomputer so that it synthesizes the optimized nucleotide sequence.

In this embodiment of the invention, the optimized nucleotide sequencecan be synthesized either automatically or through an appropriatecommand from the user, without data transfers, adjustment of parametersand the like being necessary.

The invention also provides a computer program which comprises programcode which can be executed by a computer and which, when it is executedon a computer, causes the computer to carry out a method of theinvention.

The program code can moreover, when it is executed on a computer, causea device for the automatic synthesis of nucleotide sequences to preparethe optimized nucleotide sequence.

The invention also provides a computer-readable data medium on which aprogram of the invention is stored in computer-readable form.

The invention further provides a nucleic acid which has been or can beprepared by a method of the invention, and a vector which comprises sucha nucleic acid. The invention further provides a cell which comprisessuch a vector or such a nucleic acid, and a non-human organism or anon-human life form which comprises such a cell, it also being possiblefor such a non-human life form to be mammal.

Whereas in random methods there is no correlation between a sequence ina preceding iteration step and the sequence in a subsequent iterationstep, there is according to the invention new specification of a codonin each iteration step. Since the test sequence is varied on only partof the complete sequence, the method can be carried out with lesseffort. It is possible in particular to evaluate all possiblecombinations of codons in the variation region. The invention makes usein an advantageous manner of the circumstance that long-rangecorrelations within a nucleotide sequence are of minor importance, i.e.that to achieve an acceptable optimization result it is possible to varythe codons at one position substantially independently of the codons ata more remote position.

The method of the invention makes it possible to a greater extent thanprevious methods for relevant biological criteria to be included in theassessment of a test sequence. For example, with the method of theinvention it is possible to take account of wanted or unwanted motifs inthe synthetic nucleotide sequence. Since in a motif search even anindividual codon may be crucial for whether a particular motif ispresent or not, purely stochastic methods will provide optimizedsequences which comprise a required motif only with a very lowprobability or not at all. However, this is possible with the method ofthe invention because all codon combinations are tested over a partregion of the sequence. It is possible where appropriate in order toensure the presence or non-presence of a particular sequence motif tomake the number m of optimization positions so large that it is largerthan the number of codon positions (or the number of base positionsdivided by 3) of the corresponding motif. If the m optimizationpositions are connected, it is thus ensured that the occurrence of aparticular sequence motif can be reliably detected and the correspondingmotif can be ensured in the sequence or excluded from the latter. Thenumerical calculation of the quality function has particular advantageson use of weight matrix scans. Since in this case a different level ofimportance for recognition or biological activity can be assigned to thedifferent bases of a recognition sequence, it is possible in the methodof the invention, in which all possible codon combinations are testedover a part region of the sequence, to find the sequence which, forexample, switches off most effectively a DNA motif by eliminating thebases which are most important for the activity, or it is possible tofind an optimized compromise solution with inclusion of other criteria.

The invention is not in principle restricted to a particular organism.Organisms for which an optimization of a nucleotide sequence forexpression of a protein using the method of the invention is ofparticular interest are, for example, organisms from the followinggroups:

-   -   viruses, especially vaccinia viruses,    -   prokaryotes, especially Escherichia coli, Caulobacter cresentus,        Bacillus subtilis, Mycobacterium spec.,    -   yeasts, especially Saccharomyces cerevisiae, Schizosaccharomyces        pombe, Pichia pastoris, Pichia angusta,    -   insects, especially Sprodoptera frugiperda, Drosophila spec.,    -   mammals, especially Homo sapiens, Macaca mulata, Mus musculus,        Bos taurus, Capra hircus, Ovis aries, Oryctolagus cuniculus,        Rattus norvegicus, Chinese hamster ovary,    -   monocotyledonous plants, especially Oryza sativa, Zea mays,        Triticum aestivum,    -   dicotyledonous plants, especially Glycin max, Gossypium        hirsutum, Nicotiana tabacum, Arabidopsis thaliana, Solanum        tuberosum.

Proteins for which an optimized nucleotide sequence can be generatedusing the method of the invention are, for example:

-   -   enzymes, especially polymerases, endonucleases, ligases,        lipases, proteases, kinases, phosphatases, topoisomerases,    -   cytokines, chemokines, transcription factors, oncogenes,    -   proteins from thermophilic organisms, from cryophilic organisms,        from halophilic organisms, from acidophilic organisms, from        basophilic organisms,    -   proteins with repetitive sequence elements, especially        structural proteins,    -   human antigens, especially tumor antigens, tumor markers,        autoimmune antigens, diagnostic markers,    -   viral antigens, especially from HAV, HBV, HCV, HIV, SIV, FIV,        HPV, rinoviruses, influenza viruses, herpesviruses,        poliomaviruses, hendra virus, dengue virus, AAV, adenoviruses,        HTLV, RSV,    -   antigens of protozoa and/or disease-causing parasites,        especially those causing malaria, leishmania; trypanosoma,        toxoplasmas, amoeba,    -   antigens of disease-causing bacteria or bacterial pathogens,        especially of the genera Chlamydia, staphylococci, Klebsiella,        Streptococcus, Salmonella, Listeria, Borrelia, Escherichia coli,    -   antigens of organisms of safety level L4, especially Bacillus        anthracis, Ebola virus, Marburg virus, poxviruses.

The preceding list of organisms and proteins for which the invention isused is by no means restrictive and is intended merely as example forbetter illustration.

Further features and advantages of the invention are evident from thefollowing description of exemplary embodiments of the invention withreference to the appended drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1a, 1b show a flow diagram of an exemplary embodiment of themethod of the invention,

FIG. 2 illustrates the ratio of test sequence, optimized DNA sequence,combination DNA sequence and amino acid sequence for an exemplaryembodiment of the invention,

FIG. 3 shows the regions for determining the sequence repeat,

FIGS. 4a and 4b show diagrammatically a scheme for determining sequencerepeats,

FIG. 5a shows the codon usage on exclusive optimization for codon usage,

FIG. 5b shows the GC content on exclusive optimization for codon usage,

FIG. 6a shows the codon usage on use of a first quality function,

FIG. 6b shows the GC content on use of a first quality function,

FIG. 7a shows the codon usage on use of a second quality function,

FIG. 7b shows the GC content on use of a second quality function,

FIG. 8a shows the codon usage on use of a third quality function,

FIG. 8b shows the GC content on use of a third quality function,

FIG. 9 shows a representative murine MIP1alpha calibration line inconnection with example 3,

FIG. 10 illustrates the percentage increase in the total amount ofprotein after transfection of synthetic expression constructs comparedwith wild-type expression constructs in connection with example 3,

FIG. 11 shows a representative ELISA analysis of the cell lysates andsupernatants of transfected H1299 cells in connection with example 3 and

FIG. 12A to 12C shows the expression analysis of the synthetic readingframes and of the wild-type reading frames in connection with example 3.

DETAILED DESCRIPTION

According to a preferred embodiment of the invention, in one iterationthe choice of the codon for the ith amino acid of an amino acid sequenceof length N is considered. For this purpose, all possible codoncombinations of the available codons for the amino acids at positions ito i+m−1 are formed. These positions form a variation window and specifythe optimization positions at which the sequence is to be varied. Everycombination of codons on this variation window results in a DNA sequencewith 3 m bases, which is called combination DNA sequence (CDS)hereinafter. In each iteration step, a test sequence which comprises theCDS at its end is formed for each CDS. In the first iteration step, thetest sequences consist only of the combination DNA sequences. The testsequences are weighted with a quality function which is described indetail below, and the first codon of the CDS which exhibits the maximumvalue of the quality function is retained for all further iterations ascodon of the optimized nucleotide sequence (result codon). This meansthat when the ith codon has been specified in an iteration, each of thetest sequences comprises in the next iteration this codon at position i,and the codons of the various combination DNA sequences at positions i+1to i+m. Thus, in the jth iteration, all test sequences consist atpositions 1 to j−1 of the codons found to be optimal in the precedingiterations, while the codons at positions j to j+m−1 are varied. Thequality of the DNA sequence can be expressed as criterion weight(individual score) for each individual test criterion. A total weight(total score) is formed by adding the criterion weights weightedaccording to specifications defined by the user and indicates the valueof the quality function for the complete test sequence. If j=N−m+1, theoptimal test sequence is at the same time the optimized nucleotidesequence according to the method of the invention. All the codons of theoptimal CDS in this (last) step are therefore specified as codons of theoptimized nucleotide sequence.

The procedure described above is illustrated diagrammatically in FIG. 1.The algorithm starts at the first amino acid (i=1). A first CDS of thecodons for amino acids i to i+m−1 is then formed (in the firstiteration, these are amino acids 1 to m). This CDS is combined with thepreviously optimized DNA sequence to give a test sequence. In the firststep, the optimized DNA sequence consists of 0 elements. The testsequence therefore consists in the first iteration only of thepreviously formed (first) CDS.

The test sequence is then evaluated according to criteria defined by theuser. The value of a quality function is calculated by criterion weightsbeing calculated for various assessment criteria and being calculated inan assessment function. If the value of the quality function is betterthan a stored value of the quality function, the new value of thequality function is stored. At the same time, the first codon of therelevant CDS which represents amino acid i is also stored. If the valueof the quality function is worse than the stored value, no action istaken. The next step is to check whether all possible CDS have beenformed. If this is not the case, the next possible CDS is formed andcombined with the previously optimized DNA sequence to give a new testsequence. The steps of evaluating, determining a quality function andcomparing the value of the quality function with a stored value are thenrepeated. If, on the other hand, all possible CDS have been formed, andif i≠N−m+1, the stored codon is attached at position i to the previouslyformed optimized DNA sequence. In the first iteration, the optimized DNAsequence is formed by putting the stored codon on position 1 of theoptimized DNA sequence. The process is then repeated for the next aminoacid (i+1). If, on the other hand, i=N−m+1, the complete CDS of theoptimal test sequence is attached to the optimized DNA sequencepreviously formed, because it is already optimized in relation to theassessment criteria. Output of the optimized sequence then follows.

The relationship of the various regions is depicted diagrammatically inFIG. 2. The combination DNA sequence and the region of the previouslyspecified optimized DNA sequence are evident.

The parameter m can be varied within wide limits, the aim being tomaximize the number of varied codons for the purpose of the bestpossible optimization. A worthwhile optimization result can be achievedwithin an acceptable time with a size of the variation window of fromm=5 to m=10 using the computers currently available.

Besides the individual weighting of the criterion weights, it ispossible to define both the total weight and the criterion weights bysuitable mathematical functions which are modified compared with thesimple relations such as difference or proportion, e.g. by segmentallydefined functions which define a threshold value, or nonlinearfunctions. The former is worthwhile for example in assessing repeats orinverse complementary repeats which are to be taken into account onlyabove a certain size. The latter is worthwhile for example in assessingthe codon usage or the CG content.

Various examples of weighting criteria which can be used according tothe invention are explained below without the invention being restrictedto these criteria or the weighting functions described below.

Adaptation of the codon usage of the synthetic gene to the codon usageof the host organism is one of the most important criteria in theoptimization. It is necessary to take account in this case of thedifferent degeneracy of the various codons (one-fold to six-fold).Quantities suitable for this purpose are, for example, the RSCU(relative synonymous codon usage) or relative frequencies (relativeadaptiveness) which are standardized to the frequency of the codon mostused by the organism (the codon used most thus has the codon usage of1), cf. P. M. Sharp, W. H. Li, Nucleic Acid Research 15 (1987), 1281 to1295.

To assess a test sequence in one embodiment of the invention, theaverage codon usage is used on the variation window.

When assessing the GC content, a minimal difference in the average GCcontent from the predefined desired GC content is necessary. Anadditional aim should be to keep the variations in the GC content overthe course of the sequence small.

To evaluate a test sequence, the average percentage GC content of thatregion of the test sequence which includes the CDS and bases which arelocated before the start of the CDS and whose number b is preferablybetween 20 and 30 bases is ascertained. The criterion weight isascertained from the absolute value of the difference between thedesired GC content and the GC content ascertained for the test sequence,it being possible for this absolute value to enter as argument into anonlinear function, e.g. into an exponential function.

If the variation window has a width of more than 10 codon positions,variations in the GC content within the CDS may be important. In thesecases, as explained above, the GC content for each base position isascertained on a window which is aligned in a particular way in relationto the base position and may include a particular number of, for example40, bases, and the absolute values of the difference between the desiredGC content and the “local” GC content ascertained for each base positionare summed. Division of the sum by the number of individual valuesascertained results in the average difference from the desired GCcontent as criterion weight. In the procedure described above it ispossible for the location of the window to be defined so that said baseposition is located for example at the edge or in the center of thewindow. An alternative possibility is also to use as criterion theabsolute amount of the difference between the actual GC content in thetest sequence or on a part region thereof to the desired GC content orthe absolute amount of the difference between the average of theabovementioned “local” GC content over the test sequence or a partthereof and the desired GC content as criterion. In a furthermodification it is also possible to provide for the appropriatecriterion weight to be used proportionally to the square of thedifference between the actual GC content and the desired GC content, thesquare of the difference between the GC content averaged over the basepositions and the desired GC content or the average of the square of thedifferences between the local GC content and the desired GC content ascriterion. The criterion weight for the GC content has the opposite signto the criterion weight for the codon usage.

Local recognition sequences or biophysical characteristics play acrucial role in cell biology and molecular biology. Unintendedgeneration of corresponding motifs inside the sequence of thesynthesized gene may have unwanted effects. For example, the expressionmay be greatly reduced or entirely suppressed; an effect toxic for thehost organism may also arise. It is therefore desirable in theoptimization of the nucleotide sequence to preclude unintendedgeneration of such motifs. In the simplest case, the recognitionsequence can be represented by a well-characterized consensus sequence(e.g. restriction enzyme recognition sequence) using appropriate IUPACbase symbols. Carrying out a simple regular expression search within thetest sequence results in the number of positions found for calculatingthe appropriate weight. If a certain number of imperfections(mismatches) is permitted, the number of imperfections in a recognizedmatch must be taken into account when ascertaining the weight function,for example by the local weight for a base position being inverselyproportional to the number of bases which are assigned to an IUPACconsensus symbol. However, in many cases the consensus sequence is notsufficiently clear (cf., for example, K. Quandt et al., Nucleic AcidResearch 23 (1995), 4878). It is possible in such cases to have recourseto a matrix representation of the motifs or use other recognitionmethods, e.g. by means of neural networks.

In the preferred embodiment of the invention, a value between 0 and 1which, in the ideal case, reflects the binding affinity of the(potential) site found or its biological activity or else itsreliability of recognition is determined for each motif found. Thecriterion weight for DNA motifs is calculated by multiplying this valueby a suitable weighting factor, and the individual values for each matchfound are added.

The weight for unwanted motifs is included with the opposite sign tothat for the codon usage in the overall quality function.

It is possible in the same way to include in the weighting the presenceof certain wanted DNA motifs, e.g. RE cleavage sites, certain enhancersequences or immunostimulatory or immunosuppressive CpG motifs. Theweight for wanted DNA motifs is included with the same sign as theweight for the codon usage in the overall assessment.

Highly repetitive sequence segments may, for example, lead to lowgenetic stability. The synthesis of repetitive segments is also madedistinctly difficult because of the risk of faulty hybridization.According to the preferred embodiment of the invention, therefore, theassessment of a test sequence includes whether it comprises identical ormutually similar sequence segments at various points. The presence ofcorresponding segments can be established for example with the aid of avariant of a dynamic programming algorithm for generating a localalignment of the mutually similar sequence segments. It is important inthis embodiment of the invention that the algorithm used generates avalue which is suitable for quantitative description of the degree ofmatching and/or the length of the mutually similar sequence segments(alignment weight). For further details relating to a possiblealgorithm, reference is made to the abovementioned textbooks by Gusfieldor Waterman and M. S. Waterman, M. Eggert, J. Mol. Biology, (1987) 197,723 to 728.

To calculate the criterion weight relating to the repetitive elements,the individual weights of all the local alignments where the alignmentweight exceeds a certain threshold value are summed. Addition of theseindividual weights gives the criterion weight which characterizes therepetitiveness of the test sequence.

In a modification of the embodiment described above, only the one regionwhich includes the variation window, and a certain number of furtherbases, e.g. 20 to 30, at the end of the test sequence is checked forwhether a partial segment of the test sequence occurs in identical orsimilar way in this region of another site of the test sequence. This isdepicted diagrammatically in FIG. 3. The full line in the middlerepresents the complete test sequence. The upper line represents theCDS, while the lower region represents the comparison region of the testsequence, which is checked for matching sequence segments with theremainder of the test sequence. The checking of the test sequences formatching or similar segments of the comparison region (cf. FIG. 3) usingthe dynamic programming matrix technique is illustrated in FIGS. 4 and 4b. FIG. 4a shows the case where similar or matching sequence segments Aand B are present in the comparison region itself. FIG. 4b shows thecase where a sequence segment B in the comparison region matches or issimilar to a sequence segment A outside the comparison region.

As alternative to the summation of individual weights it is alsopossible to provide for only the alignment which leads to the highestindividual weight or, more generally only the alignments with the mlargest individual weights, to be taken into account.

With the weighting described above it is possible to include bothsimilar sequences which are present for example at the start and at theend of the test sequence, and so-called tandem repeats where the similarregions are both located at the end of the sequence.

Inverse complementary repeats can be treated in the same way as simplerepeats. The potential formation of secondary structures and the RNAlevel or cruciform structures at the DNA level can be recognized on thetest sequence by the presence of such inverse complementary repeats(inverse repeats). Cruciform structures at the DNA level may impedetranslation and lead to genetic instability. It is assumed that theformation of secondary structures at the RNA level has adverse effectson translation efficiency. In this connection, inverse repeats ofparticular importance are those which form hairpin loops or cruciformstructures. Faulty hybridizations or hairpin loops may also have adverseeffects in the synthesis of the former from oligonucleotides.

The checking for inverse complementary repeats in principle takes placein analogy to the checking for simple repeats. The test sequence or thecomparison region of the test sequence is, however, compared with theinverse complementary sequence. In a refinement, the thermodynamicstability can be taken into account in the comparison (alignment), inthe simplest case by using a scoring matrix. This involves for examplegiving higher weight to a CC or GG match, because the base pairing ismore stable, than to a TT or AA match. Variable weighting forimperfections (mismatches) is also possible correspondingly. Morespecific weighting is possible by using nearest neighbor parameters forcalculating the thermodynamic stability, although this makes thealgorithm more complex. Concerning a possible algorithm, reference ismade for example to L. Kaderali, A. Schliep, Bioinformatics 18 (10)2002, 1340 to 1349.

For all the assessment criteria, the invention can provide for thecorresponding weighting function to be position-dependent. For example,a larger weight can be given to the generation of an RE cleavagesequence at a particular site, or a larger weight can be given tosecondary structures at the 5′ end, because they show strongerinhibition there. It is likewise possible to take account of the codoncontext, i.e. the preceding or following codon(s). It is additionallypossible to provide for certain codons whose use at the domain limitsplays a role in cotranslational protein folding to make a contributionto the quality function, which contribution depends on whether thiscodon is nearer to the domain limit or not. Further criteria which maybe included in the quality function are, for example, biophysicalproperties such as the rigidity or the curvature of the DNA sequence.Depending on the area of use it is also possible to include criteriawhich are associated with further DNA sequences. For example it iscrucial in the area of DNA vaccination that the sequences used forvaccination show no significant similarity to the pathogenic elements ofthe natural viral genome, in order to reliably preclude unwantedrecombination events. In the same way, vectors used for gene therapypurposes ought to show minimal similarity to sequences of the humangenome in order firstly to preclude homologous recombination into thehuman genome and secondly to avoid vital genes being selectivelyswitched off in transcription through RNA interference phenomena (RNAIphenomena). The latter is also of general importance in the productionof recombinant cell factories and, in particular, in transgenicorganisms.

The various criterion weights for various criteria can according to theinvention be included differently in the overall weight function. Inthis connection the difference which can be maximally achieved throughthe corresponding criteria in the value of the quality function isimportant for the test sequence formed. However, a large proportion ofcertain criterion weights have DNA bases which cannot be changed bydifferent CDS, such as, for example, the nucleotides in front of theCDS, which are also included in the calculation of the average GCcontent, and the nucleotides which are unaltered within synonymouscodons. The individual weighting of a criterion vis-a-vis other criteriacan therefore be made dependent on how greatly the quality of the testsequence differs from the target. It may be worthwhile to split up thecriterion weights for further processing in mathematical functions forcalculating the quality function into a part which is a measure of theportion of a criterion which is variable on use of different CDS, and apart which is a measure of the unaltered portions.

EXAMPLES

The embodiments of the invention which are described above are explainedfurther below with reference to two specific examples.

Example 1

The intention is to ascertain the optimal DNA sequence (SEQ ID NO: 9)pertaining to the (fictional) amino acid sequence AASeq1 (SEQ ID NO: 10)from below. A conventional back-translation with optimization foroptimal codon usage serves as reference.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 AASeq1: E_ Q_ F_ I_ I_ K_ N_ M_ F_ I_I_ K_ N_ A_ ASSeq1: GAA CAG TTT ATT ATT AAA AAC ATG TTT ATT ATT AAA AACGCG GAG CAA TTC ATC ATC AAG AAT TTC ATC ATC AAG AAT GCC ATA ATA ATA ATAGCA GCT

The optimization is based on the following criteria:

-   -   the codon usage is to be optimized to the codon usage of E. Coli        K12.    -   the GC content is to be as close as possible to 50%.    -   repetitions are to be excluded as far as possible    -   the Nla III recognition sequence CATG is to be excluded

The assessment function used for the codon usage is the followingfunction:

CUScore−<CU>

where <CU> in this example is the arithmetic mean of the relativeadaptiveness over the codon positions of the test sequence.

To represent the codon usage of a codon, for better comparability of thecodon quality of different amino acids, the best codon in each case fora particular amino acid is set equal to 100, and the worse codons arerescaled according to their tabulated percentage content. A CUScore of100 therefore means that only the codons optimal for E. Coli K12 areused.

The weight for the percentage GC content is calculated as follows:

GCScore=|<GC>−GC_(desire)|^(1.3)×0.8

To ascertain the individual weights of the alignments (alignment score),an optimal local alignment of the test sequence with a part region ofthe test sequence which includes a maximum of the last 36 bases of thecomplete test sequence is generated with exclusion of the identityalignment (alignment of the complete part region with itself) (cf. FIG.3, 4 a, 4 b).

The assessment parameter for a base position used in this case forcalculating the dynamic programming matrix are:

-   -   Match=1;    -   Mismatch=−2;    -   Gap=−2.

The corresponding criterion weight is specified by a power of theoptimal alignment score in the examined region of the test sequence:

-   -   REPScore=(Score_(alignment))^(1.3)

A site score of 100 000 is allocated for each CATG sequence found.

The overall quality function TotScore results

TotScore=CUScore−GCScore−REPScore−SiteScore

The CDS length m is 3 codons (9 bases).

An optimization only for optimal codon usage results in the followingsequence:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 SEQ ID NO: 10 E_ Q_ F_ I_ I_ K_ N_ M_F_ I_ I_ K_ N_ A_ SEQ ID NO: 9 GAA CAG TTT ATT ATT AAA AAC ATG TTT ATTATT AAA AAC GCG

It is characterized by the following properties:

-   -   highly repetitive, caused by the amino acid sequence F_I_I_K_N        (residues 3-7 and 9-13 of SEQ ID NO: 10) which appears twice        (the repetitive sequence (bases 7-21 and 25-39 of SEQ ID NO: 9)        with the highest score (18) is shown):

19 AACATGTTTATTATTAAAAAC 2 AACA-GTTTATTATTAAAAAC

-   -   GC content: 21.4% the Nla III recognition sequence CATG is        present average codon usage 100 Ser. No. 10/539,208 Attorney's        Docket No.: B&B-135 Art Unit: Unknown Page 3 Inventor: David        RAAB et al.

If the optimization is carried out according to the algorithm of theinvention with the abovementioned assessment functions and parameters,the following DNA sequence is obtained:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 SEQ ID NO: 10 E_ Q_ F_ 1_ 1_ K_ N_ M_F_ 1_ 1_ K_ N_ A_ SEQ ID NO: 9 GAA CAG TTC ATC ATC AAA AAT ATG TTT ATTATC AAG AAC GCG

It is characterized by the following properties:

-   -   scarcely repetitive (the alignment shown below with the highest        contribution has a score of 6)

11 TCPTCA ||||||  8 TCATCA

-   -   GC content: 31.0%    -   the Nla III recognition sequence CATG has been avoided    -   average codon usage: 88

In the optimization result according to the invention, the codon optimalin relation to codon usage was not chosen at five amino acid positions.However, the sequence found a represents an optimal balance of thevarious requirements in terms of codon usage, GC content and idealsequence properties (avoidance of repetitions).

For the amino acids with the numbers 3, 4, 5, the higher GC content ofthe codons which are worse in terms of codon usage is the reason for thechoice. At position 6, however, on comparison of the codons AAA and AAG,the considerably better codon usage of the AAA codon is dominant,although choice of the AAG codon would lead to a better GC score. Onformation of the CDS at base position 13, the codon AAC is preferred foramino acid No. 7 since, with a window size of 3 codons for the CDS, itis not yet evident that this choice will lead to the formation of theCATG DNA motif which is to be avoided (the genetic code is notdegenerate for methionine, i.e. there is only one codon for expressionof methionine). In the formation of the CDS at base position 16,however, this has been recognized and consequently the codon AAT ischosen. Besides codon usage and GC content, also the avoidance of arepetitive DNA sequence plays in the choice of the codon for amino acids9 to 13. Because of the identical amino acid sequences of amino acidsNos. 3 to 7 and 9 to 13 a crucial role. For this reason, the codons TTTand ATT are preferred for amino acids 9 and 10, in contrast topreviously (Aad. 3,4).

The following table illustrates the individual steps of the algorithmwhich have led to the optimization result indicated above. It enablesthe progress of the algorithm to be understood step by step. Moreover,all combination DNA sequences (CDS) formed by the software are listed indetail for each starting position.

The following information is given for each possible CDS:

-   -   the test sequence which was formed from each CDS and the        previously optimized DNA sequence which is used for evaluating        the CDS,    -   the scores which were ascertained for codon usage, GC content,        repetitiveness and DNA sites found (CU, GC, Rep, Site)    -   the repetitive element with the highest alignment score        ascertained for the particular test sequence,    -   the total score ascertained.

The CDS are in this case arranged according to decreasing total score,i.e. the first codon of the first CDS shown is attached to thepreviously optimized DNA sequence. The CDS in the following tablecorrespond to sequences in the attached Sequence Listing. as shownbelow: Starting

Starting Amino Acid CDS Test Sequence 1 E SEQ ID NO: 11bases 1-9 of SEQ ID NO: 9 2 Q SEQ ID NO: 12 bases 1-12 of SEQ ID NO: 93 F SEQ ID NO: 13 bases 1-15 of SEQ ID NO: 9 4 I SEQ ID NO: 14bases 1-18 of SEQ ID NO: 9 5 I SEQ ID NO: 15 bases 1-21 of SEQ ID NO: 96 K SEQ ID NO: 16 bases 1-24 of SEQ ID NO: 9 7 N SEQ ID NO: 17bases 1-27 of SEQ ID NO: 9 8 M SEQ ID NO: 18 bases 1-30 of SEQ ID NO: 99 F SEQ ID NO: 19 bases 1-33 of SEQ ID NO: 9 10 I SEQ ID NO: 20bases 1-36 of SEQ ID NO: 9 11 I SEQ ID NO: 21 bases 1-39 of SEQ ID NO: 912 K SEQ ID NO: 22 bases 1-42 of SEQ ID NO: 9 CDS Total test sequence CUGC Site Rep Alignment Score CDS starting position 1 for amino acid 1 EGAACAGTTC  92  5      0  0.0 G  87.0 | GAACAGTTC G GAACAGTTT 100 19     0  0.0 TT  81.0 || GAACAGTTT TT GAGCAGTTT  82  5      0  0.0 AG 77.0 || GAGCAGTTT AG GAGCAGTTC  73  5      0  0.0 AG  68.0 || GAGCAGTTCAG GAACAATTC  76 19      0  0.0 AA  57.0 || GAACAATTC AA GAGCAATTC  58 5      0  0.0 G  53.0 | GAGCAATTC G GAACAATTT  85 38      0  0.0 AA 47.0 || GAACAATTT AA GAGCAATTT  66 19      0  0.0 TT  47.0 || GAGCAATTTTT CDS starting position 4 for amino acid 2 Q CAGTTCATC  86  8      0 0.0 CA  78.0 || GAACAGTTCATC CA CAGTTTATC  94 19      0  0.0 TT  75.0|| GAACAGTTTATC TT CAGTTCATT  92 19      0  0.0 CA  73.0 || GAACAGTTCATTCA CAGTTTATT 100 33      0  0.0 TT  67.0 || GAACAGTTTATT TT CAATTCATC 70 19      0  0.0 AA  51.0 || GAACAATTCATC AA CAATTTATC  79 33      0 0.0 AA  46.0 || GAACAATTTATC AA CAGTTCATA  63 19      0  0.0 CA  44.0|| GAACAGTTCATA CA CAATTCATT  76 33      0  0.0 ATT  43.0 |||GAACAATTCATT ATT CAGTTTATA  71 33      0  0.0 TT  38.0 || GAACAGTTTATATT CAATTTATT  85 48      0  0.0 ATT  37.0 ||| GAACAATTTATT ATT CAATTCATA 48 33      0  0.0 AA  15.0 || GAACAATTCATA AA CAATTTATA  56 48      0 0.0 AA   8.0 || GAACAATTTATA AACDS starting position 7 for amino acid 3 F TTCATCATC  80 10      0  0.0TCATC  70.0 ||||| GAACAGTTCATCATC TCATC TTTATCATC  88 19      0  0.0 ATC 69.0 ||| GAACAGTTTATCATC ATC TTCATTATC  86 19      0  0.0 CA  67.0 ||GAACAGTTCATTATC CA TTCATCATT  86 19      0  0.0 TCAT  67.0 ||||GAACAGTTCATCATT TCAT TTTATTATC  94 30      0  0.0 TTAT  64.0 ||||GAACAGTTTATTATC TTAT TTTATCATT  94 30      0  0.0 CA  64.0 ||GAACAGTTTATCATT CA TTCATTATT  92 30      0  0.0 ATT  62.0 |||GAACAGTTCATTATT ATT TTTATTATT 100 42      0  0.0 TTATT  58.0 |||||GAACAGTTTATTATT TTATT TTCATCATA  57 19      0  0.0 TCAT  38.0 ||||GAACAGTTCATCATA TCAT TTCATAATC  57 19      0  0.0 AA  38.0 ||GAACAGTTCATAATC AA TTTATCATA  65 30      0  0.0 CA  35.0 ||GAACAGTTTATCATA CA TTTATAATC  65 30      0  0.0 AA  35.0 ||GAACAGTTTATAATC AA TTCATTATA  63 30      0  0.0 CA  33.0 ||GAACAGTTCATTATA CA TTCATAATT  63 30      0  0.0 AA  33.0 ||GAACAGTTCATAATT AA TTTATTATA  71 42      0  0.0 TTAT  29.0 ||||GAACAGTTTATTATA TTAT TTTATAATT  71 42      0  0.0 AA  29.0 ||GAACAGTTTATAATT AA TTCATAATA  34 30      0  0.0 ATA   4.0 |||GAACAGTTCATAATA ATA TTTATAATA  43 42      0  0.0 ATA   1.0 |||GAACAGTTTATAATA ATA CDS starting position 10 for amino acid 4 IATCATCAAA  88 19      0  0.0 TCATCA  69.0 |||||| GAACAGTTCATCATCAAATCATCA ATTATCAAA  94 28      0  0.0 TCA  66.0 ||| GAACAGTTCATTATCAAA TCAATCATTAAA  94 28      0  0.0 TCAT  66.0 |||| GAACAGTTCATCATTAAA TCATATTATTAAA 100 38      0  0.0 ATTA  62.0 |||| GAACAGTTCATTATTAAA ATTAATCATCAAG  65 11      0  0.0 TCATCA  54.0 |||||| GAACAGTTCATCATCAAGTCATCA ATTATCAAG  71 19      0  0.0 TCA  52.0 ||| GAACAGTTCATTATCAAG TCAATCATTAAG  71 19      0  0.0 TCAT  52.0 |||| GAACAGTTCATCATTAAG TCATATTATTAAG  77 28      0  0.0 ATTA  49.0 |||| GAACAGTTCATTATTAAG ATTAATCATAAAA  65 28      0  0.0 TCAT  37.0 |||| GAACAGTTCATCATAAAA TCATATAATCAAA  65 28      0  0.0 TCA  37.0 ||| GAACAGTTCATAATCAAA TCAATTATAAAA  71 38      0  0.0 AAA  33.0 ||| GAACAGTTCATTATAAAA AAAATAATTAAA  71 38      0  0.0 TAA  33.0 ||| GAACAGTTCATAATTAAA TAAATCATAAAG  43 19      0  0.0 TCAT  24.0 |||| GAACAGTTCATCATAAAG TCATATAATCAAG  43 19      0  0.0 TCA  24.0 ||| GAACAGTTCATAATCAAG TCAATTATAAAG  49 28      0  0.0 AA  21.0 || GAACAGTTCATTATAAAG AA ATAATTAAG 49 28      0  0.0 TAA  21.0 ||| GAACAGTTCATAATTAAG TAA ATAATAAAA  43 38     0  0.0 ATAA   5.0 |||| GAACAGTTCATAATAAAA ATAA ATAATAAAG  20 28     0  0.0 ATAA  −8.0 |||| GAACAGTTCATAATAAAG ATAACDS starting position 13 for amino acid 5 I ATCAAAAAC  94 19      0  0.0TCATCA  75.0 |||||| GAACAGTTCATCATCAAAAAC TCATCA ATTAAAAAC 100 27      0 0.0 TCAT  73.0 |||| GAACAGTTCATCATTAAAAAC TCAT ATCAAAAAT  88 27      0 0.0 TCATCA  61.0 |||||| GAACAGTTCATCATCAAAAAT TCATCA ATTAAAAAT  94 35     0  0.0 TCAT  59.0 |||| GAACAGTTCATCATTAAAAAT TCAT ATTAAGAAC  77 19     0  0.0 GAAC  58.0 |||| GAACAGTTCATCATTAAGAAC GAAC ATCAAGAAC  71 13     0  0.0 TCATCA  58.0 |||||| GAACAGTTCATCATCAAGAAC TCATCA ATCAAGAAT 65 19      0  0.0 TCATCA  46.0 |||||| GAACAGTTCATCATCAAGAAT TCATCAATTAAGAAT  71 27      0  0.0 TCAT  44.0 |||| GAACAGTTCATCATTAAGAAT TCATATAAAAAAC  71 27      0  0.0 TCAT-A-AAAAA  44.0 |||| | |||||GAACAGTTCATCATAAAAAAC TCATCATAAAAA ATAAAAAAT  65 35      0  0.0TCAT-A-AAAAA  30.0 |||| | ||||| GAACAGTTCATCATAAAAAAT TCATCATAAAAAATAAAGAAC  49 19      0  0.0 GAAC  30.0 |||| GAACAGTTCATCATAAAGAAC GAACATAAAGAAT  43 27      0  0.0 TCAT  16.0 |||| GAACAGTTCATCATAAAGAAT TCATCDS starting position 16 for amino acid 6 K AAAAATATG  94 26      0  0.0TCATCA  68.0 |||||| GAACAGTTCATCATCAAAAATATG TCATCA AAGAATATG  71 19     0  0.0 TCATCA  52.0 |||||| GAACAGTTCATCATCAAGAATATG TCATCAAAAAACATG 100 19 200000  0.0 TCATCA 919.0 ||||||GAACAGTTCATCATCAAAAACATG TCATCA AAGAACATG  77 13 200000  0.0 TCATCA936.0 |||||| GAACAGTTCATCATCAAGAACATG TCATCACDS starting position 19 for amino acid 7 N AATATGTTT  94 35      0  0.0TCATCA  59.0 |||||| GAACAGTTCATCATCAAAAATATG TCATCA TTT AATATGTTC  86 28     0  0.0 TCATCA  58.0 |||||| GAACAGTTCATCATCAAAAATATG TCATCA TTCAACATGTTT 100 28 200000  0.0 TCATCA 928.0 ||||||GAACAGTTCATCATCAAAAACATG TCATCA TTT AACATGTTC  92 21 200000  0.0AACATGTTC 929.0 |||| |||| GAACAGTTCATCATCAAAAACATG AACA-GTTC TTCCDS starting position 22 for amino acid 8 M ATGTTTATC  94 35      0  0.0TCATCA  59.0 |||||| GAACAGTTCATCATCAAAAATATG TCATCA TTTATC ATGTTTATT 10042      0  0.0 TCATCA  58.0 |||||| GAACAGTTCATCATCAAAAATATG TCATCATTTATT ATGTTCATT  92 35      0  0.0 GTTCAT  57.0 ||||||GAACAGTTCATCATCAAAAATATG GTTCAT TTCATT ATGTTCATC  86 28      0 12.5GTTCATC  45.0 ||||||| GAACAGTTCATCATCAAAAATATG GTTCATC TTCATC ATGTTTATA 71 42      0  0.0 TCATCA  29.0 |||||| GAACAGTTCATCATCAAAAATATG TCATCATTTATA ATGTTCATA  63 35      0  0.0 GTTCAT  28.0 ||||||GAACAGTTCATCATCAAAAATATG GTTCAT TTCATACDS starting position 25 for amino acid 9 F TTTATTATC  94 42      0  0.0TCATCA  52.0 |||||| GAACAGTTCATCATCAAAAATATG TCATCA TTTATTATC TTTATCATT 94 42      0  0.0 TCATCA  52.0 |||||| GAACAGTTCATCATCAAAAATATG TCATCATTTATCATT TTCATTATT  92 42      0  0.0 GTTCAT  50.0 ||||||GAACAGTTCATCATCAAAAATATG GTTCAT TTCATTATT TTTATCATC  88 35      0 12.5GTTTATCATC  40.0 ||| |||||| GAACAGTTCATCATCAAAAATATG GTTCATCATCTTTATCATC TTTATTATT 100 49      0 12.5 TCATCA--AAAATATGTTTATTATT  38.0||||||  |||| || | | ||||| GAACAGTTCATCATCAAAAATATGTCATCATCAAAA-ATATGT-TTATT TTTATTATT TTCATTATC  86 35      0 12.5GTTCATTATC  38.0 |||||| ||| GAACAGTTCATCATCAAAAATATG GTTCATCATCTTCATTATC TTCATCATT  86 35      0 17.4 GTTCATCAT  34.0 |||||||||GAACAGTTCATCATCAAAAATATG GTTCATCAT TTCATCATT TTCATCATC  80 28      020.0 GTTCATCATC  32.0 |||||||||| GAACAGTTCATCATCAAAAATATG GTTCATCATCTTCATCATC TTTATCATA  65 42      0  0.0 TCATCA  23.0 ||||||GAACAGTTCATCATCAAAAATATG TCATCA TTTATCATA TTTATAATC  65 42      0  0.0TCATCA  23.0 |||||| GAACAGTTCATCATCAAAAATATG TCATCA TTTATAATC TTTATTATA 71 49      0  0.0 TCATCA  22.0 |||||| GAACAGTTCATCATCAAAAATATG TCATCATTTATTATA TTTATAATT  71 49      0  0.0 TCATCA  22.0 ||||||GAACAGTTCATCATCAAAAATATG TCATCA TTTATAATT TTCATAATT  63 42      0  0.0GTTCAT  21.0 |||||| GAACAGTTCATCATCAAAAATATG GTTCAT TTCATAATT TTCATTATA 63 42      0  0.0 GTTCAT  21.0 |||||| GAACAGTTCATCATCAAAAATATG GTTCATTTCATTATA TTCATAATC  57 35      0 12.5 GTTCATAATC   9.0 |||||| |||GAACAGTTCATCATCAAAAATATG GTTCATCATC TTCATAATC TTCATCATA  57 35      017.4 GTTCATCAT   5.0 ||||||||| GAACAGTTCATCATCAAAAATATG GTTCATCATTTCATCATA TTTATAATA  43 49      0  0.0 TCATCA  −6.0 ||||||GAACAGTTCATCATCAAAAATATG TCATCA TTTATAATA TTCATAATA  34 42      0  0.0GTTCAT  −8.0 |||||| GAACAGTTCATCATCAAAAATATG GTTCAT TTCATAATACDS starting position 28 for amino acid 10 I ATTATCAAA  94 49      012.5 GTTTATTATCAAA  32.0 ||| || |||||| GAACAGTTCATCATCAAAAATATGGTTCATCATCAAA TTTATTATCAAA ATCATTAAA  94 49      0 12.5 GTTTATCATTAAA 32.0 ||| ||||| ||| GAACAGTTCATCATCAAAAATATG GTTCATCATCAAA TTTATCATTAAAATTATCAAG  71 42      0  0.0 TCATCA  29.0 ||||||GAACAGTTCATCATCAAAAATATG TCATCA TTTATTATCAAG ATCATTAAG  71 42      0 0.0 TCATCA  29.0 |||||| GAACAGTTCATCATCAAAAATATG TCATCA TTTATCATTAAGATTATTAAA 100 57      0 14.9 TCATCA--AAAATATGTTTATTATTA  28.0 |||||| |||| || | | |||||| GAACAGTTCATCATCAAAAATATG TCATCATCAAAA-ATATGT-TTATTATTTATTATTAAA ATCATCAAA  88 42      0 20.0 GTTTATCATCAAA  26.0 |||||||||||| GAACAGTTCATCATCAAAAATATG GTTCATCATCAAA TTTATCATCAAA ATTATAAAA 71 57      0  0.0 TCATCA  14.0 |||||| GAACAGTTCATCATCAAAAATATG TCATCATTTATTATAAAA ATAATTAAA  71 57      0  0.0 TCATCA  14.0 ||||||GAACAGTTCATCATCAAAAATATG TCATCA TTTATAATTAAA ATTATTAAG  77 49      014.9 TCATCA--AAAATATGTTTATTATTA  13.0 ||||||  |||| || | | ||||||GAACAGTTCATCATCAAAAATATG TCATCATCAAAA-ATATGT-TTATTA TTTATTATTAAGATCATCAAG  65 35      0 17.4 GTTTATCATCAA  13.0 ||| ||||||||GAACAGTTCATCATCAAAAATATG GTTCATCATCAA TTTATCATCAAG ATAATCAAA  65 49     0 12.5 GTTTATAATCAAA   3.0 ||| || |||||| GAACAGTTCATCATCAAAAATATGGTTCATCATCAAA TTTATAATCAAA ATCATAAAA  65 49      0 14.9 GTTTATCAT-AAAA  1.0 ||| ||||| |||| GAACACTTCATCATCAAAAATATG GTTCATCATCAAAATTTATCATAAAA ATAATCAAG  43 42      0  0.0 TCATCA   1.0 ||||||GAACAGTTCATCATCAAAAATATG TCATCA TTTATAATCAAG ATTATAAAG  49 49      0 0.0 TCATCA   0.0 |||||| GAACAGTTCATCATCAAAAATATG TCATCA TTTATTATAAAGATAATTAAG  49 49      0  0.0 TCATCA   0.0 ||||||GAACAGTTCATCATCAAAAATATG TCATCA TTTATAATTAAG ATCATAAAG  43 42      012.5 GTTTATCAT-AAA −12.0 ||| ||||| ||| GAACAGTTCATCATCAAAAATATGGTTCATCATCAAA TTTATCATAAAG ATAATAAAA  43 57      0  0.0 TCATCA −14.0|||||| GAACAGTTCATCATCAAAAATATG TCATCA TTTATAATAAAA ATAATAAAG  20 49     0  0.0 TCATCA −29.0 |||||| GAACAGTTCATCATCAAAAATATG TCATCATTTATAATAAAG CDS starting position 31 for amino acid 11 I ATCAAGAAC  7142      0  0.0 TCATCA  29.0 |||||| GAACAGTTCATCATCAAAAATATG TCATCATTTATTATCAAGAAC ATTAAAAAC 100 57      0 14.9 TCATCA--AAAATATGTTTATTATTA 28.0 ||||||  |||| || | | |||||| GAACAGTTCATCATCAAAAATATGTCATCATCAAAA-ATATGT-TTATTA TTTATTATTAAAAAC ATCAAAAAC  94 49      0 17.4GTTTATTATCAAAAA  28.0 ||| || |||||||| GAACAGTTCATCATCAAAAATATGGTTCATCATCAAAAA TTTATTATCAAAAAC ATTAAAAAT  94 64      0 14.9TCATCA--AAAATATGTTTATTATTA  15.0 ||||||  |||| || | | ||||||GAACAGTTCATCATCAAAAATATG TCATCATCAAAA-ATATGT-TTATTA TTTATTATTAAAAATATTAAGAAC  77 49      0 14.9 TCATCA--AAAATATGTTTATTATTA  13.0 |||||| |||| || | | |||||| GAACAGTTCATCATCAAAAATATG TCATCATCAAAA-ATATGT-TTATTATTTATTATTAAGAAC ATCAAAAAT  88 57      0 20.0 GTTTATTATCAAAAAT  11.0 ||||| ||||||||| GAACAGTTCATCATCAAAAATATG GTTCATCATCAAAAAT TTTATTATCAAAAATATCAAGAAT  65 49      0 12.5 GTTTATTATCAAGAAT   3.0 ||| || ||||| |||GAACAGTTCATCATCAAAAATATG GTTCATCATCAAAAAT TTTATTATCAAGAAT ATAAAGAAC  4949      0  0.0 TCATCA   0.0 |||||| GAACAGTTCATCATCAAAAATATG TCATCATTTATTATAAAGAAC ATTAAGAAT  71 57      0 14.9 TCATCA--AAAATATGTTTATTATTA −1.0 ||||||  |||| || | | |||||| GAACAGTTCATCATCAAAAATATGTCATCATCAAAA-ATATGT-TTATTA TTTATTATTAAGAAT ATAAAAAAC  71 57      0 14.9TCATCA--AAAATATGTTTATTA-TA-AAAAA  −1.0 ||||||  |||| || | | ||| || |||||GAACAGTTCATCATCAAAAATATG TCATCATCAAAA-ATATGT-TTATTATAAAAATTTATTATAAAAAAC ATAAAAAAT  65 64      0 14.9TCATCA--AAAATATGTTTATTA-TA-AAAAA −14.0 ||||||  |||| || | | ||| || |||||GAACAGTTCATCATCAAAAATATG TCATCATCAAAA-ATATGT-TTATTATAAAAATTTATTATAAAAAAT ATAAAGAAT  43 57      0  0.0 TCATCA −14.0 ||||||GAACAGTTCATCATCAAAAATATG TCATCA TTTATTATAAAGAATCDS starting position 34 for amino acid 12 K AAGAACGCG  77 28      0 0.0 TCATCA  49.0 |||||| GAACAGTTCATCATCAAAAATATG TCATCATTTATTATCAAGAACGCG AAAAACGCG 100 35      0 17.4 GTTTATTATCAAAAA  48.0||| || |||||||| GAACAGTTCATCATCAAAAATATG GTTCATCATCAAAAATTTATTATCAAAAACGCG AAGAACGCC  69 28      0  0.0 TCATCA  41.0 ||||||GAACAGTICATCATCAAAAATATG TCATCA TTTATTATCAAGAACGCC AAAAACGCC  92 35     0 17.4 GTTTATTATCAAAAA  40.0 ||| || ||||||||GAACAGTTCATCATCAAAAATATG GTTCATCATCAAAAA TTTATTATCAAAAACGCC AAAAATGCG 94 42      0 20.0 GTTTATTATCAAAAAT  32.0 ||| || |||||||||GAACAGTTCATCATCAAAAATATG GTTCATCATCAAAAAT TTTATTATCAAAAATGCG AAGAACGCA 63 35      0  0.0 TCATCA  28.0 |||||| GAACAGTTCATCATCAAAAATATG TCATCATTTATTATCAAGAACGCA AAAAACCCA  86 42      0 17.4 GTTTATTATCAAAAA  27.0||| || |||||||| GAACAGTTCATCATCAAAAATATG GTTCATCATCAAAAATTTATTATCAAAAACGCA AAAAATGCC  86 42      0 20.0 GTTTATTATCAAAAAT  24.0||| || ||||||||| GAACAGTTCATCATCAAAAATATG GTTCATCATCAAAAATTTTATTATCAAAAATGCC AAGAACGCT  59 35      0  0.0 TCATCA  24.0 ||||||GAACAGTTCATCATCAAAAATATG TCATCA TTTATTATCAAGAACGCT AAGAATGCG  71 35     0 12.5 GTTTATTATCAAGAAT  23.0 ||| || ||||| |||GAACAGTTCATCATCAAAAATATG GTTCATCATCAAAAAT TTTATTATCAAGAATGCG AAAAACGCT 81 42      0 17.4 GTTTATTATCAAAAA  22.0 ||| || ||||||||GAACAGTTCATCATCAAAAATATG GTTCATCATCAAAAA TTTATTATCAAAAACGCT AAGAATGCC 63 35      0 12.5 GTTTATTATCAAGAAT  15.0 ||| || ||||| |||GAACAGTTCATCATCAAAAATATG GTTCATCATCAAAAAT TTTATTATCAAGAATGCC AAAAATGCA 80 49      0 20.0 GTTTATTATCAAAAAT  11.0 ||| || |||||||||GAACAGTTCATCATCAAAAATATG GTTCATCATCAAAAAT TTTATTATCAAAAATGCA AAAAATGCT 75 49      0 20.0 GTTTATTATCAAAAAT   6.0 ||| || |||||||||GAACAGTTCATCATCAAAAATATG GTTCATCATCAAAAAT TTTATTATCAAAAATGCT AAGAATGCA 57 42      0 12.5 GTTTATTATCAAGAAT   2.0 ||| || ||||| |||GAACAGTTCATCATCAAAAATATG GTTCATCATCAAAAAT TTTATTATCAAGAATGCT AAGAATGCT 53 42      0 12.5 GTTTATTATCAAGAAT  −2.0 ||| || ||||| |||GAACAGTTCATCATCAAAAATATG GTTCATCATCAAAAAT TTTATTATCAAGAATGCT

Example 2

This example considers the optimization of GFP for expression in E.Coli.

Origin of the amino acid sequence (SEQ ID NO: 23):

DEFINITION Aequorea victoria green-fluorescent protein mRNA, completecds. ACCESSION M62654

MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFS RYPDHMKQHDFFKSAMPEGYVQERTIFYKDDGNYKSRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKMEYNYNS HNVYlMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRoHMILLE FVT AAGI THGMDELYKCodon usage table used: Escherichia coli K12Origin: online codon usage database

The meanings below are:

-   <CU>: average renormalized codon usage of the CDS (15 bases long)-   <GC>: average percentage GC content of the last 35 bases of the test    sequence-   GC_(desire): desired GC content

The size of the window on which the GC content was calculated for thegraphical representation in FIG. 5b to 8b was 40 bases

FIGS. 5a and 5b show the results for the quality function:

Score=<CU>

FIGS. 6a and 6b show the results for the quality function

Score=<CU>−|<GC>−GC_(desire)|^(1.3)×0.8

FIGS. 7a and 7b show the results for the quality function

Score=<CU>−|<GC>−GC_(desire)|^(1.3)×1.5

FIGS. 8a and 8b show the results for the quality function

Score=<CU>−|<GC>−GC_(desire)|^(1.3)×5

FIGS. 5 to 8 illustrate the influence of the different weighting of twooptimization criteria on the optimization result. The aim is to smooththe GC content distribution over the sequence and approach the value of50%. In the case shown in FIGS. 5a and 5b , optimization was only foroptimal codon usage, resulting in a very heterogeneous GC distributionwhich in some cases differed greatly from the target content. In thecase of FIGS. 6a and 6b there is an ideal conjunction of a smoothing ofthe GC content to a value around 50% with a good to very good codonusage. The cases of FIGS. 7a and 7b, and 8a and 8b , finally illustratethat although a further GC content optimization is possible, it isnecessarily at the expense of a poor codon usage in places.

Example 3

The efficiency of the method of the invention is illustrated by thefollowing exemplary embodiment in which expression constructs withadapted and RNA- and codon-optimized reading frames were prepared, andin which the respective expression of the protein was quantified.

Selected cytokine genes and chemokine genes from various organisms(human: IL15, GM-CSF and mouse: GM-CSF, MIP1alpha) were cloned into theplasmid pcDNA3.1(+) (Invitrogen) to prepare expression plasmids. Thereading frames of the corresponding genes were optimized using a codonchoice like that preferentially found in human and murine cells,respectively, and using the optimization method described herein formaximal expression in the relevant organism. The corresponding geneswere artificially assembled after the amino acid sequence of the geneswas initially translated into a nucleotide sequence like that calculatedby the described method taking account of various parameters.

The optimization of the cytokine genes was based on the followingparameters:

the following quality function was used to assess the test sequence:

TotScore=CUScore−GCScore−REPScore−SEKscore−SiteScore

The CDS length was 5 codons.

The individual scores are in this case defined as follows:

a) CUScore=<CU>

where <CU> represents the arithmetic mean of the relative adaptivenessvalues of the CDS codons, multiplied by 100, i.e. to represent the codonusage of a codon, for better comparability of the codon quality ofdifferent amino acids the codon which is best in each case for aparticular amino acid is set equal to 100, and the worst codons arerescaled according to their tabulated percentage content. A CUScore of100 therefore means that only codons optimal for the expression systemare used. In the cytokine genes to be optimized, the CUScore wascalculated on the basis of the codon frequencies in humans (Homosapiens) which are listed in the table below. Only codons whose relativeadaptiveness is greater than 0.6 are used in the optimizations.Source: GENBANK™ release 138.0 [Oct. 15, 2003] codon usage database

AmAcid Codon Frequency Ala GCG 0.10 GCA 0.23 GCT 0.26 GCC 0.40 Arg AGG0.20 AGA 0.20 CGG 0.20 CGA 0.11 CGT 0.06 CGC 0.19 Asn AAT 0.45 AAC 0.55Asp GAT 0.46 GAC 0.54 Cys TGT 0.45 TGC 0.55 End TGA 0.61 TAG 0.17 TAA0.21 Gln CAG 0.73 CAA 0.27 Glu GAG 0.58 GAA 0.42 Gly GGG 0.25 GGA 0.28GGT 0.16 GGC 0.34 His CAT 0.41 CAC 0.59 Ile ATA 0.16 ATT 0.36 ATC 0.47Leu TTG 0.12 TTA 0.08 CTG 0.38 CTA 0.09 CTT 0.13 CTC 0.20 Lys AAG 0.56AAA 0.44 Met ATG 1.00 Phe TTT 0.45 TTC 0.55 Pro CCG 0.11 CCA 0.27 CCT0.28 CCC 0.34 Ser AGT 0.15 AGC 0.24 TCG 0.05 TCA 0.15 TCT 0.18 TCC 0.22Thr ACG 0.11 ACA 0.29 ACT 0.24 ACC 0.37 Trp TGG 1.00 Tyr TAT 0.44 TAC0.56 Val GTG 0.45 GTA 0.12 GTT 0.18 GTC 0.24

b) GCScore=|<GC>−GC_(desire)|×2

-   with <GC>: average percentage GC content of the last 35 bases of the    test sequence    -   GC_(desire): desired percentage GC content of 60%

c) REPScore=(Score_(alignment,max))

To ascertain the individual weights of the alignments (alignment score),a local alignment of a terminal part region of the test sequence whichincludes a maximum of the last 35 bases of the complete test sequence iscarried out with the region located in front in the test sequence.

Assessment parameters used in this case for a base position are:

Match=10; Mismatch=−30; Gap=−30.

The corresponding criterion weight REPScore is defined as the highestalignment score Score_(alignment,maxt) reached in the checked region ofthe test sequence. If the value of Score_(alignment,max)) is <100, thenREPScore is set equal to 0.

d) SEKScore=(Score_(InvAligne n1 max)))

The criterion weight SEKScore weights inverse alignments in the sequenceproduced. To ascertain the individual weight of an alignment(Score_(InvAlignment,max)), a local alignment of the inversecomplementary of the test sequence is carried out with the part regionof the test sequence which includes a maximum of the last 35 bases ofthe complete test sequence.

The assessment parameters used for a base position in this case are:

-   -   Match=10;    -   Mismatch=−30;    -   Gap=−30.

The corresponding criterion weight SEKScore is defined as the highestalignment score Score_(InvAlignment,max) reached in the checked regionof the test sequence. If the value of Score_(InvAlignment,max) is <100,then SEKScore is set equal to 0.

e) Sitescore

The following table lists the sequence motifs taking into account inascertaining the SITEScore. Where a y appears on the heading “REVERSE”,both the stated sequence motif and the relevant inverse complementarysequence motif was taken into account. If an n is indicated under thisheading, only the stated sequence motif, but not the sequence motifinverse complementary thereto, was taken into account. For eachoccurrence of the sequence motifs listed in the table (or their inversecomplementary if REVERSE=y) within the last 35 bases of the testsequence, the criterion weight SITEScore is increased by a value of 100000.

NAME SEQUENCE REVERSE KpnI GGTACC n SacI GAGCTC n Eukaria: (consensus) YTRAY n branch point Eukaria: (consensus)  YYYYYYYYYN n Splice Acceptor(1, 10)AG Eukaria: (consensus)  RGGTANGT n Splice-Donor1Eukaria: poly(A)-site (1) AATAAA n Eukaria: poly(A)-site (2) TTTTTATA nEukaria: poly(A)-site (3) TATATA n Eukaria: poly(A)-site (4) TACAYA nEukaria: poly(A)-site (5) TAGTAGTA n Eukaria: poly(A)-site (6) ATATATTTn Eukaria: (consensus)  ACGTANGT n Splice-Donor2 Eukaria: (Cryptic) RGGTNNGT n Splice-Donor1 BsmBI CGTCTC y BbsI GAAGAC yEukaria: (Cryptic)  RGGTNNHT n Splice-Donor2 Eukaria: (Cryptic) NGGTNNGT n Splice-Donor3 Eukaria: RNA inhib.  WWWATTTAWWW n SequenceGC-Stretch SSSSSSSSS n Chi-Sequence GCTGGTGG y Repeats RE (w(9.))/1 nProkaria: RBS-Entry (2) AAGGAGN y (3, 13)ATG Prokaria: RBS-Entry (1)AGGAGGN y (3, 13)ATG Prokaria: RBS-Entry (3) TAASGAGGTN y (3, 13)DTGProkaria: RBS-Entry (4) AGAGAGN y (3, 13)ATG Prokaria: RBS-Entry (5)AAGGAGGN y (3, 13)ATG Prokaria: RBS-Entry (6) AACGGAGGN y (3, 13)ATGProkaria: RBS-Entry (7) AAGAAGGAAN y (3, 13)ATG HindII AAGCTT n NotIGCGGCCGC n BamHI GGATCC n EcoRI GAATTC n XbaI TCTAGA n XhoI CTCGAG n

The following sequences in the tables above correspond to sequences inthe attached Sequence Listing: Eukaria: (consensus) Spice Acceptor (SEQID NO: 24); Eukaria: RNA inhib. Sequence (SEQ ID NO: 25); Prokaria:RBS-Entry (2) (SEQ ID NO: 26); Prokaria: RBS-Entry (1) (SEQ ID NO: 27);Prokaria: RBS-Entry (3) (SEQ ID NO: 28); Prokaria: RBS-Entry (4) (SEQ IDNO: 29); Prokaria: RBS-Entry (5) (SEQ ID NO: 30); Prokaria: RBS-Entry(6) (SEQ ID NO: 31); Prokaria: RBS-Entry (7) (SEQ ID NO: 32).

Appropriate unique restriction cleavage sites were introduced forsubcloning. The complete nucleotide sequences are indicated in theannex. The sequences modified in this way were prepared as fullysynthetic genes (Geneart, Regensburg). The resulting coding DNAfragments was placed under the transcriptional control of the cytomegalovirus (CMV) early promotor/enhancer in the expression vector pcDNA3.1(+)using the restriction cleavage sites HindIII and NotI. To prepareexpression plasmids which are analogous but unaltered in their codonchoice (wild-type reference constructs), the coding regions (c-DNAconstructs were produced from RZPD) were cloned after PCR amplificationwith appropriate oligonucleotides likewise using the HindIII and NotIrestriction cleavage sites in pcDNA3.1(+).

To quantify cytokine/chemokine expression, human cells were transfectedwith the respective expression constructs, and the amount of protein inthe cells and in the cell culture supernatant was measured by usingcommercial ELISA test kits.

All the cell culture products were from Life Technologies (Karlsruhe).Mammalian cell lines were cultivated at 37° C. and 5% CO₂. The humanlung carcinoma cell line H1299 was cultivated in Dulbecco's modificatedEagle medium (DMEM) with L-glutamine, D-glucose (4.5 mg/ml), sodiumpyruvate, 10% inactivated fetal bovine serum, penicillin (100 U/ml) andstreptomycin (100 μg/ml). The cells were subcultivated in the ratio 1:10after reaching confluence.

2.5×10⁵ cells were seeded in 6-well cell culture dishes and, after 24 h,transfected by calcium phosphate coprecipitation (Graham and Eb, 1973)with 15 μg of expression plasmids or pcDNA 3.1 vector (mock control).Cells and culture supernatants were harvested 48 h after thetransfection. Insoluble constituents in the supernatants were removed bycentrifugation and 10 000×g and 4° C. for 10 min. The transfected cellswere washed twice with ice-cold PBS (10 mM Na₂HPO₄, 1.8 mM KH₂PO₄, 137ml NaCl, 2.7 mM KCl), detached with 0.05% trypsin/EDTA, centrifuged at300×g for 10 min and lysed in 100 μl of lysis buffer (50 mM Tris-HCl, pH8.0, 150 mM NaCl, 0.1% SDS (w/v), 1% Nonidet P40 (v/v), 0.5% Nadeoxycholate (w/v)) on ice for 30 min. Insoluble constituents of thecell lysate were removed by centrifugation at 10 000×g and 4° C. for 30min. The total amount of protein in the cell lysate supernatant wasdetermined using the Bio-Rad protein assay (Bio-Rad, Munich) inaccordance with the manufacturer's instructions.

The specific protein concentrations in the cell lysates and cell culturesupernatants were quantified by ELISA tests (BD Pharmingen for IL15 andGM-CSF; R & D Systems for MIP1alpha). Appropriate amounts of totalprotein of the cell lysate (0.2 to 5 μg) and dilutions of thesupernatant (undiluted to 1:200) were analyzed according to themanufacturer's instructions, and the total concentration was calculatedby means of a calibration plot. FIG. 9 shows a representativecalibration plot for calculating the murine MIP1alpha concentration.Recombinant murine MIP1alpha was adjusted in accordance with themanufacturer's instructions by serial two-fold dilutions to increasingconcentrations and employed in parallel with the samples from the cellculture experiments in the MIP1alpha specific ELISA test. Theconcentrations (x axis) were plotted against the measured O.D. values(450 nm, y axis), and a regression line was calculated using MS Excel(the regression coefficient R² is indicated).

This was supplemented by carrying out a detection by Western blotanalyses for suitable samples. For GM-CSF samples, total proteins wereprecipitated from in each case 1 ml of cell culture supernatant by NaDOC (sodium deoxycholate) and TCA (trichloroacetic acid) and resuspendedin 60 μl of 1× sample buffer (Laemmli, 1970). 20 μl were employed foreach of the analyses. For IL15 detection, 25 μg of total protein fromcell lysates were used. The samples were heated at 95° C. for 5 min,fractionated on a 15% SDS/polyacrylamide gel (Laemmli, 1970)electrotransferred to a nitrocellulose membrane (Bio-Rad) and analyzedwith appropriate monoclonal antibodies (BD Pharmingen), detected using asecondary, AP (alkaline phosphatase)-coupled antibody and demonstratedby chromogenic staining. FIG. 12A to C show the expression analysis ofthe synthetic reading frame and of the wild-type reading frames. H1299cells were transfected with the stated constructs, and the proteinproduction was detected by conventional immunoblot analyses. In thiscase, FIG. 12A shows the analysis of the cell culture supernatants afterNa Doc/TCA precipitation of human GM-CSF transfected H1299 cells, FIG.12B shows the analysis of the cell culture supernatants after Na Doc/TCAprecipitation of murine GM-CSF transfected H1299 cells, FIG. 12C showsthe analysis of the cell lysates from human IL15 transfected H1299cells. Molecular weights (precision plus protein standard, Bio-Rad) andloading of the wild-type, synthetic and mock-transfected samples areindicated. Mock transfection corresponds to transfection with originalpcDNA3.1 plasmid.

The following table summarizes the expression differences with averagesof all ELISA-analyzed experiments. The data correspond to the percentagedifference in the total amount of protein (total amount of protein incell lysate and supernatant) related to the corresponding wild-typeconstruct (wt corresponds to 100%).

Comparison of the total amounts of protein after transfection ofwild-type vs. synthetic expression constructs Construct Organism MW*StdDev** n= GM-CSF human 173% 53% 4 IL15 human 181% 37% 3 GM-CSF mouse127% 12% 2 MIP11alpha mouse 146% 48% 2 *percentage average of the amountof protein from n experiments (in duplicate) related to the total amountof protein for the corresponding wild-type construct **standarddeviation

FIG. 10 shows in the form of a bar diagram the relative amount ofprotein in relation to the respective wild-type construct (correspondsto 100%) and illustrates the percentage increase in the total amount ofprotein after transvection of synthetic expression constructs comparedwith wild-type expression constructs. H1299 cells were transfected with15 μg of the stated cytokine/chemokine constructs. The respectiveprotein production was quantified by conventional ELISA tests in thecell culture supernatant and in the cell lysate by means of appropriatestandard plots (see FIG. 9). The ratio of the total amount of protein ofsynthetic to wild-type protein was calculated in each experiment(consisting of two independent mixtures) and indicated as percent of thetotal wild-type protein. The bars represent the average of fourexperiments for human GM-CSF, of three experiments for human IL15 and oftwo experiments for murine MIP1alpha and GM-CSF, in each case inindependent duplicates. The error bars correspond to the standarddeviation.

FIG. 11 depicts a representative ELISA analysis of the cell lysates andsupernatants of transfected H1299 cells for human GM-CSF. H1299 cellswere transfected with 15 μg each of wild-type and optimized human GM-CSFconstructs. The respective protein concentration was quantified byconventional ELISA tests in the cell culture supernatant and in the celllysate by means of appropriate standard plots. The bars represent thevalue of the total amount of protein in the cell lysate (CL), in thecell culture supernatant (SN) and the total of these values (total) forin each case 2 independent mixtures (1 and 2).

This analysis shows that the increase in expression after optimization(hu GM-CSF opt) is consistently detectable in the cell lysate andsupernatant. It also illustrates by way of example that secretion of thecytokines is unaffected by the optimization by this method. A distinctand reproducible increase in protein expression was detectable for alloptimized constructs, with the synthesis efficiencies of the optimizedgenes being improved by comparison with the wild-type genes in eachindividual experiment.

Expression was additionally checked in Western blot analyses (FIG. 12Ato C). Human and murine GM-CSF were detectable in the cell culturesupernatant (after Na DOC/TCA precipitation) (FIGS. 12A and B), whilehuman IL15 was detectable in the cell lysates (FIG. 12C). The proteinswere analyzed, compared with commercially available recombinant proteins(BD) and the molecular weight was correspondingly confirmed. It was notpossible in these transient transfection experiments to detect murineMIP1alpha by immunoblot staining. Comparison of the wild-type with thesynthetic proteins in these representative immunoblots confirms the dataof the ELISA analyses of an improved protein synthesis throughmultiparameter optimization of these genes.

The features disclosed in the claims, the drawings and the descriptionmay be essential both singly and in any combination for implementationof the invention in its various embodiments.

Annex: SEQ-IDs and alignments of the DNA sequences usedThe SEQ-In references used herein correspond to the similarly numbered sequences in the attached Sequence Listing, e.g., “SEQ-In1”corresponds to SEQ In NO: 1, “SEQ-In2” corresponds to SEQ In NO: 2, etc.SEQ-ID of the indicated constructs:Following the abstract, please replace the original Sequence Listing with the Substitute Sequence Listing attached hereto assubstitute sheets (10 pages). SEQ-ID1 (human GM-CSF wild type):   1atgtggctgc agagcctgct gctcttgggc actgtggcct gcagcatctc tgcacccgcc  61cgctcgccca gccccagcac gcagccctgg gagcatgtga atgccatcca ggaggcccgg 121cgtctcctga acctgagtag agacactgct gctgagatga atgaaacagt agaagtcatc 181tcagaaatgt ttgacctcca ggagccgacc tgcctacaga cccgcctgga gctgtacaag 241cagggcctgc ggggcagcct caccaagctc aagggcccct tgaccatgat ggccagccac 301tacaagcagc actgccctcc aaccccggaa acttcctgtg caacccagat tatcaccttt 361gaaagtttca aagagaacct gaaggacttt ctgcttgtca tcccctttga ctgctgggag 421ccagtccagg agtag SEQ-ID2 (human GM-CSF optimized):   1ctgtggctgc agagcctgct gctgctggga acagtggcct gtagcatctc tgcccctgcc  61agaagcccta gccctagcac acagccttgg gagcacgtga atgccatcca ggaggccagg 121agactgctga acctgagcag agatacagcc gccgagatga acgagaccgt ggaggtgatc 181agcgagatgt tcgacctgca ggagcctaca tgcctgcaga cccggctgga gctgtataag 241cagggcctga gaggctctct gaccaagctg aagggccccc tgacaatgat ggccagccac 301tacaagcagc actgccctcc tacccctgag acaagctgcg ccacccagat cattaccttc 361gagagcttca aggagaacct gaaggacttc ctgctggtga tccccttcga ttgctgggag 421cccgtgcagg agtag SEQ-ID3 (human IL15 wild type):   1atgagaattt cgaaaccaca tttgagaagt atttccatcc agtgctactt gtgtttactt  61ctaaacagtc attttctaac tgaagctggc attcatgtct tcattttggg ctgtttcagt 121gcagggcttc ctaaaacaga agccaactgg gtgaatgtaa taagtgattt gaaaaaaatt 181gaagatctta ttcaatctat gcatattgat gctactttat atacggaaag tgatgttcac 241cccagttgca aagtaacagc aatgaagtgc tttctcttgg agttacaagt tatttcactt 301gagtccggag atgcaagtat tcatgataca gtagaaaatc tgatcatcct agcaaacaac 361agtttgtctt ctaatgggaa tgtaacagaa tctggatgca aagaatgtga ggaactggag 421gaaaaaaata ttaaagaatt tttgcagagt tttgtacata ttgtccaaat gttcatcaac 481acttcttag SEQ-ID4 (human IL15 optimized):   1atgcggatca gcaagcccca cctgaggagc atcagcatcc agtgctacct gtgcctgctg  61ctgaacagcc acttcctgac agaggccggc atccacgtgt ttatcctggg ctgcttctct 121gccggcctgc tccagagcat gcacatcgac gccaccctgt acacagagag cgacgtgcac 181gaggacctga tccagagcat gcacatcgac gccaccctgt acacagagag cgacgtgcac 241cctagctgta aggtgaccgc catgaagtgc ttcctgctgg agctgcaggt gatcagcctg 301gagagcggcg atgccagcat ccacgacacc gtggagaacc tgatcatcct ggccaacaac 361agcctgagca gcaacggcaa tgtgaccgag agcggctgca aggagtgtga ggagctggag 421gagaagaaca tcaaggagtt cctgcagagc ttcgtgcaca tcgtgcagat gttcatcaac 481accagctag SEQ-ID5 (murine GM-CSF wild type):   1atgtggctgc agaatttact tttcctgggc attgtggtct acagcctctc agcacccacc  61cgctcaccca tcactgtcac ccggccttgg aagcatgtag aggccatcaa agaagccctg 121aacctcctgg atgacatgcc tgtcacattg aatgaagagg tagaagtcgt ctctaacgag 181ttctccttca agaagctaac atgtgtgcag acccgcctga agatattcga gcagggtcta 241cggggcaatt caactccgga aacggactgt gaaacacaag ttaccaccta tgcggatttc 301tactgccccc caactccgga aacggactgt gaaacacaag ttaccaccta tgcggatttc 361atagacagcc ttaaaacctt tctgactgat atcccctttg aatgcaaaaa accaggccaa 421aaatag SEQ-ID6 (murine GM-CSF optimized):   1atgtggctgc agaacctgct gttcctgggc atcgtggtgt acagcctgag cgcccccacc  61aggagcccca tcaccgtgac caggccctgg aagcacgtgg aggccatcaa ggaggccctg 121aacctgctgg acgacatgcc cgtgaccctg aacgaggagg tggaggtggt gagcaacgag 181ttcagcttca agaagctgac ctgcgtgcag accaggctga agatcttcga gcagggcctg 241aggggcaact tcaccaagct gaagggcgcc ctgaatatga ccgccagcta ctaccagacc 301tactgccccc ccacccccga gaccgactgc gagacccagg tgaccaccta cgccgacttt 361atcgacagcc tgaagacctt cctgaccgac atccccttcg agtgcaagaa gcccggccag 421aagtag SEQ-ID7 (murine MiPlapha wild type):   1atgaaggtct ccaccactgc ccttgctgtt cttgtctgta ccatgacact ctgcaaccaa  61gtcttctcag cgccatatgg agctgacacc ccgactgcct gctgcttctc ctacagccgg 121aagattccac gccaattcat cgttgactat tttgaaacca gcagcctttg ctcccagcca 181ggtgtcattt tcctgactaa gagaaaccgg cagatctgcg ctgactccaa agagacctgg 241gtctaagaat acatcactga cctggaactg aatgcctagSEQ-ID8 (murine MIPlapha optimized):   1atgaaggtga gcaccagagc tctggctgtg ctgctgtgca ccatgaccct gtgcaaccag  61gtgttcagcg ctccttacgg cgccgatacc cctacagcct gctgcctcag ctacagcagg 121aagatcccca ggcagttcat cgtggactac ttcgagacca gcagcctgtg ttctcagccc 181ggcgtgatct tcctgaccaa gcggaacaga cagatctgcg ccgacagcaa ggagacatgg 241gtgcaggagt acatcaccga cctggagctg aacgcctag

Alignments of the DNA Sequences Used

1. Human GM-CSF:

Upper line: SEQ-ID1 (human GM-CSF wild type), from 1 to 435Lower line: SEQ-ID2 (human GM4-CSF optimized), from 1 to 435Wild type: optimized identity=83.45% (363/435) gap=0.00% 10/435)

  1 ATGTGGCTGCAGAGCCTGCTGCTCTTGGGCACTGTGGCCTGCAGCATCTCTGCACCCGCC|||||||||||||||||||||||  |||| || |||||||| ||||||||||| || |||   1ATGTGGCTGCAGAGCCTGCTGCTGCTGGGAACAGTGGCCTGTAGCATCTCTCCCCCTGCC  61CGCTCGCCCAGCCCCAGCACGCAGCCCTGGGAGCATGTGAATGCCATCCAGGAGGCCCGG  |    ||||||| ||||| ||||| |||||||| ||||||||||||||||||||| ||  61AGAAGCCCTAGCCCTAGCACACAGCCTTGGGAGCACGTGAATGCCATCCAGGAGGCCAGG 121CGTCTCCTGAACCTGAGTAGAGACACTGCTGCTGAGATGAATGAAACAGTAGAAGTCATC  | ||||||||||||| ||||| || || || |||||||| || || || || || ||| 121AGACTGCTGAACCTGAGCAGAGATACAGCCGCCGAGATGAACGAGACCGTGGAGGTGATC 181TCAGAAATGTTTGACCTCCAGGAGCCGACCTGCCTACAGACCCGCCTGGAGCTGTACAAG    || |||||||||| |||||||| || ||||| |||||||| ||||||||||| ||| 181AGCGAGATGTTCGACCTGCAGGAGCCTACATGCCTGCAGACCCGGCTGGAGCTGTATAAG 241CAGGGCCTGCGGGGCAGCCTCACCAAGCTCAAGGGCCCCTTGACCATGATGGCCAGCCAC ||||||||| ||||   || |||||||| ||||||||| |||| ||||||||||||||| 241CAGGGCCTGAGAGGCTCTCTGACCAAGCTGAAGGGCCCCCTGACAATGATGGCCAGCCAC 301TACAAGCAGCACTGCCCTCCAACCCCGGAAACTTCCTGTGCAACCCAGATTATCACCTTT|||||||||||||||||||| ||||| || ||   ||| || |||||||| |||||||| 301TACAAGCAGCACTGCCCTCCTACCCCTGAGACAAGCTGCGCCACCCAGATCATCACCTTC 361GAAAGTTTCAAAGAGAACCTGAAGGACTTTCTGCTTGTCATCCCCTTTGACTGCTGGGAG || || |||||||||||||||||||||| ||||| || |||||||| || ||||||||| 361GAGAGCTTCAAGGAGAACCTGAAGGACTTCCTGCTGGTGATCCCCTTCGATTGCTGGGAG 421CCAGTCCAGGAGTAG || || ||||||||| 421 CCCGTGCAGGAGTAG

2. Human IL15:

Upper line: SEQ-ID3 (human IL15 wild type), from 1 to 489Lower line: SEQ-ID4 (human IL15 optimized), from 1 to 489Wild type: optimized identity=70.55% (345/489) gap=0.00% (0/489)

  1 ATGAGAATTTCGAAACCACATTTGAGAAGTATTTCCATCCAGTGCTACTTGTGTTTACTT ||| |||    || || ||  |||| || ||   ||||||||||||| ||||  | ||   1ATGCGGATCAGCAAGCCCCACCTGAGGAGCATCAGCATCCAGTGCTACCTGTGCCTGCTG  61CTAAACAGTCATTTTCTAACTGAAGCTGGCATTCATGTCTTCATTTTGGGCTGTTTCAGT || ||||| |||| || || || || ||||| || || || ||  ||||||| |||  |  61CTGAACAGCCACTTCCTGACAGAGGCCGGCATCCACGTGTTTATCCTGGGCTGCTTCTCT 121GCAGGGCTTCCTAAAACAGAAGCCAACTGGGTGAATGTAATAAGTGATTTGAAAAAAATT || || ||||||| ||||| |||||||||||||| || || || ||  |||| || || 121GCCGGCCTGCCTAAGACAGAGGCCAACTGGGTGAACGTGATCAGCGACCTGAAGAAGATC 181GAAGATCTTATTCAATCTATGCATATTGATGCTACTTTATATACGGAAAGTGATGTTCAC || || || ||||    ||||| || || || ||  | || || || || || || ||| 181GAGGACCTGATCCAGAGCATGCACATCGACGCCACCCTGTACACAGAGAGCGACGTGCAC 241CCCAGTTGCAAAGTAACAGCAATGAAGTGCTTTCTCTTGGAGTTACAAGTTATTTCACTT || || || |||| || || ||||||||||| ||  ||||| | || || ||    || 241CCTAGCTGTAAGGTGACCGCCATGAAGTGCTTCCTGCTGGAGCTGCAGGTGATCAGCCTG 301GAGTCCGGAGATGCAAGTATTCATGATACAGTAGAAAATCTGATCATCCTAGCAAACAAC |||  |||||||| || || || || || || || || ||||||||||| || |||||| 301GAGAGCGGCGATGCCAGCATCCACGACACCGTGGAGAACCTGATCATCCTGGCCAACAAC 361AGTTTGTCTTCTAATGGGAATGTAACAGAATCTGGATGCAAAGAATGTGAGGAACTGGAG ||  ||     || || ||||| || ||    || ||||| || |||||||| |||||| 361AGCCTGAGCAGCAACGGCAATGTGACCGAGAGCGGCTGCAAGGAGTGTGAGGAGCTGGAG 421GAAAAAAATATTAAAGAATTTTTGCAGAGTTTTGTACATATTGTCCAAATGTTCATCAAC || || || |||| || ||  ||||||| || || || || || || |||||||||||| 421GAGAAGAACATCAAGGAGTTCCTGCAGAGCTTCGTGCACATCGTGCAGATGTTCATCAAC 481ACTTCTTAG ||    ||| 481 ACCAGCTAG

3. Murine GM-CSF:

Upper line: SEQ-ID5 (murine GM-CSF wild type), from 1 to 426Lower line: SEQ-ID6 (marine GM-CSF optimized), from 1 to 426Wild type: optimized identity=80.75% (344/426) gap=0.00% (0/426)

  1 ATGTGGCTGCAGAATTTACTTTTCCTGGGCATTGTGGTCTACAGCCTCTCAGCACCCACC||||||||||||||  | || ||||||||||| ||||| ||||||||    || ||||||   1ATGTGGCTGCAGAACCTGCTGTTCCTGGGCATCGTGGTGTACAGCCTGAGCGCCCCCACC  61CGCTCACCCATCACTGTCACCCGGCCTTGGAAGCATGTAGAGGCCATCAAAGAAGCCCTG  |   |||||||| || ||| |||| |||||||| || ||||||||||| || ||||||  61AGGAGCCCCATCACCGTGACCAGGCCCTGGAAGCACGTGGAGGCCATCAAGGAGGCCCTG 121AACCTCCTGGATGACATGCCCTGTCACATTGAAGAAGAGGTAGAAGTCGTCTCTAACGAG ||||| |||||||||||||| || ||  ||| || ||||| || || ||    |||||| 121AACCTGCTGGACGACATGCCCGTGACCCTGAACGAGGAGGTGGAGGTGGTGAGCAACGAG 181TTCTCCTTCAAGAAGCTAACATGTGTGCAGACCCGCCTGAAGATATTCGAGCAGGGTCTA ||| |||||||||||| || || ||||||||| | |||||||| ||||||||||| || 181TTCAGCTTCAAGAAGCTGACCTGCGTCCAGACCAGGCTGAAGATCTTCGAGCAGGGCCTG 241CGGGGCAATTTCACCAAACTCAAGGGCGCCTTGAACATGACAGCCAGCTACTACCAGACA  ||||||||||||||| || ||||||||| |||||||||| ||||||||||||||||| 241AGGGGCAACTTCACCAAGCTGAAGGGCGCCCTGAACATGACCGCCAGCTACTACCAGACC 301TACTGCCCCCCAACTCCGGAAACGGACTGTGAAACACAAGTTACCACCTATGCGGATTTC ||||||||||||| || || || ||||| || || || || |||||||| || || ||| 301TACTGCCCCCCCACCCCCGAGACCGACTGCGAGACCCAGGTGACCACCTACGCCGAGTTC 361ATAGACAGCCTTAAAACCTTTCTGACTGATATCCCCTTTGAATGCAAAAAACCAGGCCAA || |||||||||| ||||| ||||| || |||||||| || ||||| || || ||||| 361ATCGACAGCCTGAAGACCTTCCTGACCGACATCCCCTTCGAGTGCAAGAAGCCCGGCCAG 421 AAATAG|| ||| 421 AAGTAG

4. Murine MIP1alpha:

Upper line: SEQ-ID7 (murine MIP1alpha wild type), from 1 to 279Lower line: SEQ-ID8 (murine MIP1alpha optimized), from 1 to 279Wild type: optimized identity=78.49% (219/279) gap=0.00% (0/279)

  1 ATGAAGGTCTCCACCACTGCCCTTGCTGTTCTTCTCTGTACCATGACACTCTGCAACCAA||||||||   |||||| || || ||||| || || || |||||||| || ||||||||   1ATGAAGGTGAGCACCACAGCTCTGGCTGTGCTGCTGTGCACCATGACCCTGTGCAACCAG  61GTCTTCTCAGCGCCATATGGAGCTGACACCCCGACTGCCTGCTGCTTCTCCTACAGCCGG || |||   |||| || || || || ||||| || |||||||||||| |||||||| ||  61GTGTTCAGCGCTCCTTACGGCGCCGATACCCCTACAGCCTGCTGCTTCAGCTACAGCAGG 121AAGATTCCACGCCAATTCATCGTTGACTATTTTGAAACCAGCAGCCTTTGCTCCCAGCCA ||||| ||  ||| |||||||| ||||| || || ||||||||||| || || ||||| 121AAGATCCCCAGGCAGTTCATCGTGGACTACTTCGAGACCAGCAGCCTGTGTTCTCAGCCC 181GGTGTCATTTTCCTGACTAAGAGAAACCGGCAGATCTGCGCTGACTCCAAAGAGACCTGG || || |||||||||| ||| | ||| | ||||||||||| |||  ||| ||||| ||| 181GGCGTGATCTTCCTGACCAAGCGGAACAGACAGATCTGCGCCGACAGCAAGGAGACATGG 241GTCCAAGAATACATCACTGACCTGGAACTGAATGCCTAG || || || |||||||| |||||||| ||||||||||| 241 GTGCAGGAGTACATCACCGACCTGGAGCTGAACGCCTAG

1. A method for optimizing a nucleotide sequence for the expression of aprotein without modifying the amino acid sequence of the protein, whichcomprises the following steps carried out on a computer: generation of afirst test sequence of n codons which correspond to n consecutive aminoacids in the protein sequence, where n is a natural number and is lessthan or equal to N, the number of amino acids in the protein sequence,specification of m optimization positions in the test sequence whichcorrespond to the position of m codons at which the occupation by acodon, relative to the test sequence, is to be optimized, where m≦n andm<N, generation of one or more further test sequences from the firsttest sequence by replacing at one or more of the m optimizationpositions a codon of the first test sequence by another codon whichexpresses the same amino acid, assessment of each of the test sequenceswith a quality function and ascertaining the test sequence which isoptimal in relation to the quality function, specification of p codonsof the optimal test sequence which are located at one of the moptimization positions, as result codons which form the codons of theoptimized nucleotide sequence, and are not further optimized insubsequent iterations, at the positions which corresponds to theposition of said p codons in the test sequence, where p is a naturalnumber and p≦m, iteration of the preceding steps, where in eachiteration step the test sequence comprises the appropriate result codonat the positions which correspond to positions of specified resultcodons in the optimized nucleotide sequence, and the optimizationpositions are different from positions of result codons; and synthesisof the optimized nucleotide sequence.
 2. The method as claimed in claim1, characterized in that in one or more iteration steps the moptimization positions of the test sequences directly follow one or moreresult codons which have been specified as part of the optimizednucleotide sequence.
 3. The method as claimed in claim 1, characterizedin that in one or more iteration steps the p codons which are specifiedas result codons of the optimized nucleotide sequence are p consecutivecodons.
 4. The method as claimed in claim 1, characterized in that inone iteration step test sequences with all possible codon occupationsfor the m optimization positions are generated from the first testsequence, and the optimal test sequence is ascertained from these testsequences.
 5. The method as claimed in claim 1, characterized by:assessment of each test sequence with a quality function, ascertainingof an extreme value within the values of the quality function for allpartial sequences generated in an iteration step, specification of pcodons of the test sequence which corresponds to the extremal value ofthe weight function as result codons at the appropriate positions, wherep is a natural number and p≦m.
 6. The method as claimed in claim 5,characterized in that the quality function takes account of one or moreof the following criteria: codon usage for a predefined organism, GCcontent, repetitive sequences, secondary structures, inversecomplementary sequence repeats and sequence motifs.
 7. The method asclaimed in claim 6, characterized in that the quality function is afunction of various single terms which in each case assess one criterionfrom the following list of criteria: codon usage for a predefinedorganism, GC content, sequence motifs, repetitive sequences, secondarystructures, inverse complementary sequence repeats.
 8. The method asclaimed in claim 1, characterized in that the quality function takesaccount of one or more of the following criteria: exclusion of inversecomplementary sequence identities of more than 20 nucleotides to thetranscriptome of a predefined organism, exclusion of homology regions ofmore than 100 base pairs to a predefined DNA sequence, exclusion ofhomology regions with more than 90% similarity of the nucleotidesequence to a predefined DNA sequence.
 9. (canceled)
 10. The method asclaimed in claim 1, characterized in that the step of synthesizing theoptimized nucleotide sequence takes place in a device for automaticsynthesis of nucleotide sequences which is controlled by the computerwhich optimizes the nucleotide sequence.
 11. A device for optimizing anucleotide sequence for the expression of a protein on the basis of theamino acid sequence of the protein, which has a computer unit comprisinginstructions, including: instructions for generation of a first testsequence of n codons which correspond to n consecutive amino acids inthe protein sequence, where n is a natural number and is less than orequal to N, the number of amino acids in the protein sequence,instructions for specification of m optimization positions in the testsequence which correspond to the position of m codons at which theoccupation by a codon, relative to the test sequence, is to beoptimized, where m≦n and m<M, instructions for generation of one or morefurther test sequences from the first test sequence by replacing at oneor more of the m optimization positions a codon of the first testsequence by another codon which expresses the same amino acid,instructions for assessment of each of the test sequences with a qualityfunction and for ascertaining the test sequence which is optimal inrelation to the quality function, instructions for specification of pcodons of the optimal test sequence which are located at one of the moptimization positions, as result codons which form the codons of theoptimized nucleotide sequence at the positions which correspond to thepositions of said p codons in the test sequence, where p is a naturalnumber and p≦m, instructions for iteration of the steps of generation ofa plurality of test functions, of assessment of the test sequences andof specification of result codons, where in each iteration step the testsequence comprises the appropriate result codon at the positions whichcorrespond to positions of specified result condons in the optimizednucleotide sequence, and the optimization positions are different frompositions of result codons.
 12. The device as claimed in claim 11,characterized by a unit for carrying out the steps of a method asclaimed in claim
 1. 13. The device as claimed in claim 11, characterizedby a device for automatic synthesis of nucleotide sequences which iscontrolled by the computer in such a way that it synthesizes theoptimized nucleotide sequence.
 14. A computer program which comprisesprogram code which can be executed by a computer and which, when it isexecuted on a computer, causes the computer to carry out a method asclaimed in claim
 1. 15. (canceled)
 16. A computer-readable data mediumon which a program as claimed in claim 14 is stored in computer-readableform. 17-28. (canceled)